Abstract
The de-identification of clinical notes is crucial for the reuse of electronic clinical data and is a common Named Entity Recognition (NER) task. Neural language models provide a great improvement in Natural Language Processing (NLP) tasks, such as NER, when they are integrated with neural network methods. This paper evaluates the use of current state-of-the-art deep learning methods (Bi-LSTM-CRF) in the task of identifying patient names in clinical notes, for de-identification purposes. We used two corpora and three language models to evaluate which combination delivers the best performance. In our experiments, the specific corpus for the de-identification of clinical notes and a contextualized embedding with word embeddings achieved the best result: an F-measure of 0.94.
This work was partially supported by Institute of Artificial Intelligence in Healthcare, Memed, Google Latin America Research Awards, and by FCT under the project UIDB/00057/2020 (Portugal).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., Vollgraf, R.: FLAIR: an easy-to-use framework for state-of-the-art NLP. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, pp. 54–59. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-4010, https://www.aclweb.org/anthology/N19-4010
Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649 (2018)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Brown, T.B., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Proceedings of the 33th Annual Conference on Neural Information Processing Systems (2020)
El Emam, K.: Guide to the De-identification of Personal Health Information. CRC Press, Boca Raton (2013)
Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Silva, J., Aluísio, S.: Portuguese word embeddings: evaluating on word analogies and natural language tasks. In: Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology, pp. 122–131 (2017)
Hash, J., Bowen, P., Johnson, A., Smith, C., Steinberg, D.: An introductory resource guide for implementing the health insurance portability and accountability act (HIPAA) security rule. US Department of Commerce, Technology Administration, National Institute of \(\ldots \) (2005)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Jiang, Y., Hu, C., Xiao, T., Zhang, C., Zhu, J.: Improved differentiable architecture search for language modeling and named entity recognition. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp. 3585–3590. Association for Computational Linguistics, Hong Kong, China (2019)
Jurafsky, D., Martin, J.H.: Speech and Language Processing, vol. 3. Pearson, London, United Kingdom (2014)
Lee, K., Filannino, M., Uzuner, Ö.: An empirical test of GRUs and deep contextualized word representations on de-identification. In: MedInfo, pp. 218–222 (2019)
Leevy, J.L., Khoshgoftaar, T.M., Villanustre, F.: Survey on RNN and CRF models for de-identification of medical free text. J. Big Data 7(1), 1–22 (2020)
Magboo, Ma. Sheila A.., Coronel, Andrei D..: Data mining electronic health records to support evidence-based clinical decisions. In: Chen, Yen-Wei., Zimmermann, Alfred, Howlett, Robert J.., Jain, Lakhmi C.. (eds.) Innovation in Medicine and Healthcare Systems, and Multimedia. SIST, vol. 145, pp. 223–232. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-8566-7_22
Meystre, S.M., Friedlin, F.J., South, B.R., Shen, S., Samore, M.H.: Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med. Res. Methodol. 10(1), 1–16 (2010)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Bengio, Y., LeCun, Y. (eds.) Proceedings of the 1st International Conference on Learning Representations (2013)
Nakayama, H., Kubo, T., Kamura, J., Taniguchi, Y., Liang, X.: doccano: text annotation tool for human (2018). software available from https://github.com/doccano/doccano
Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the Conference of the North American chapter of the association for computational linguistics: human language technologies, pp. 2227–2237 (2018)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020)
Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. arXiv preprint cs/0306050 (2003)
Santos, D., Freitas, C., Oliveira, H.G., Carvalho, P.: Second harem: new challenges and old wisdom. In: International Conference on Computational Processing of the Portuguese Language. pp. 212–215. Springer (2008). https://doi.org/10.1007/978-3-540-85980-2_22
Santos, D., Seco, N., Cardoso, N., Vilela, R.: Harem: An advanced NER evaluation contest for Portuguese. In: quot; In: Calzolari, N., et al. (ed.) Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa Italy 22–28 May 2006 (2006)
dos Santos, H.D.P., Silva, A.P., Maciel, M.C.O., Burin, H.M.V., Urbanetto, J.S., Vieira, R.: Fall detection in EHR using word embeddings and deep learning. In: 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE), pp. 265–268, October 2019. https://doi.org/10.1109/BIBE.2019.00054
dos Santos, H.D.P., Ulbrich, A.H.D., Woloszyn, V., Vieira, R.: DDC-outlier: preventing medication errors using unsupervised learning. IEEE J. Biomed. Health Inform. 23, 8 (2018)
dos Santos, H.D.P., Ulbrich, A.H.D., Woloszyn, V., Vieira, R.: An initial investigation of the Charlson comorbidity index regression based on clinical notes. In: 2018 IEEE 31st International Symposium on Computer-Based Medical Systems (CBMS), pp. 6–11. IEEE (2018)
Santos, J., Consoli, B.S., dos Santos, C.N., Terra, J., Collovini, S., Vieira, R.: Assessing the impact of contextual embeddings for Portuguese named entity recognition. In: Proceedings of the 8th Brazilian Conference on Intelligent Systems, pp. 437–442 (2019)
Santos, J., dos Santos, H.D., Vieira, R.: Fall detection in clinical notes using language models and token classifier. In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), pp. 283–288. IEEE (2020)
Straková, J., Straka, M., Hajic, J.: Neural architectures for nested NER through linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5326–5331. Association for Computational Linguistics (2019)
Stubbs, A., Filannino, M., Uzuner, Ö.: De-identification of psychiatric intake records: overview of 2016 CEGS N-GRID shared tasks track 1. J. Biomed. Inform. 75, S4–S18 (2017)
Stubbs, A., Uzuner, Ö.: Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UThealth corpus. J. Biomed. Inform. 58, S20–S29 (2015)
Acknowledgments
We thank Dr. Ana Helena D. P. S. Ulbrich, who provided the clinical notes dataset from the hospital, for her valuable cooperation. We also thank the volunteers of the Institute of Artificial Intelligence in Healthcare Celso Pereira and Ana Lúcia Dias, for the dataset annotation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Santos, J., dos Santos, H.D.P., Tabalipa, F., Vieira, R. (2021). De-Identification of Clinical Notes Using Contextualized Language Models and a Token Classifier. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13074. Springer, Cham. https://doi.org/10.1007/978-3-030-91699-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-91699-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91698-5
Online ISBN: 978-3-030-91699-2
eBook Packages: Computer ScienceComputer Science (R0)