De-Identification of Clinical Notes Using Contextualized Language Models and a Token Classifier

Joaquim Santos; Henrique D. P. dos Santos; Fábio Tabalipa; Renata Vieira

De-Identification of Clinical Notes Using Contextualized Language Models and a Token Classifier

Joaquim Santos University of Évora http://orcid.org/0000-0002-0581-4092
Henrique D. P. dos Santos noharm.ai https://orcid.org/0000-0002-2410-3536
Fábio Tabalipa Memed https://orcid.org/0000-0002-2060-953X
Renata Vieira University of Évora http://orcid.org/0000-0003-2449-5477

Resumo

The de-identification of clinical notes is crucial for the reuse of electronic clinical data and is a common Named Entity Recognition (NER) task. Neural language models provide a great improvement in Natural Language Processing (NLP) tasks, such as NER, when they are integrated with neural network methods. This paper evaluates the use of current state-of-the-art deep learning methods (Bi-LSTM-CRF) in the task of identifying patient names in clinical notes, for de-identification purposes. We used two corpora and three language models to evaluate which combination delivers the best performance. In our experiments, the specific corpus for the de-identification of clinical notes and a contextualized embedding with word embeddings achieved the best result: an F-measure of 0.94.

Palavras-chave: De-identification, Clinical notes, Language model, Token classifier

Springer (English)

Publicado

29/11/2021

Como Citar

Selecione um Formato

SANTOS, Joaquim; SANTOS, Henrique D. P. dos; TABALIPA, Fábio; VIEIRA, Renata. De-Identification of Clinical Notes Using Contextualized Language Models and a Token Classifier. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 10. , 2021, Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . ISSN 2643-6264.