De-Identification of Clinical Notes Using Contextualized Language Models and a Token Classifier

Resumo


The de-identification of clinical notes is crucial for the reuse of electronic clinical data and is a common Named Entity Recognition (NER) task. Neural language models provide a great improvement in Natural Language Processing (NLP) tasks, such as NER, when they are integrated with neural network methods. This paper evaluates the use of current state-of-the-art deep learning methods (Bi-LSTM-CRF) in the task of identifying patient names in clinical notes, for de-identification purposes. We used two corpora and three language models to evaluate which combination delivers the best performance. In our experiments, the specific corpus for the de-identification of clinical notes and a contextualized embedding with word embeddings achieved the best result: an F-measure of 0.94.
Palavras-chave: De-identification, Clinical notes, Language model, Token classifier
Publicado
29/11/2021
SANTOS, Joaquim; SANTOS, Henrique D. P. dos; TABALIPA, Fábio; VIEIRA, Renata. De-Identification of Clinical Notes Using Contextualized Language Models and a Token Classifier. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 10. , 2021, Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . ISSN 2643-6264.