Semantic Textual Similarity for Abridging Clinical Notes in Brazilian Electronic Health Records

Lucas T. Bandeira; Bernardo S. Consoli; Renata Vieira; Rafael H. Bordin

doi:10.5753/stil.2023.234200

Lucas T. Bandeira PUC-RS http://orcid.org/0009-0004-9919-0904
Bernardo S. Consoli PUC-RS https://orcid.org/0000-0003-0656-511X
Renata Vieira University of Évora https://orcid.org/0000-0003-2449-5477
Rafael H. Bordin PUC-RS https://orcid.org/0000-0001-8688-9901

DOI: https://doi.org/10.5753/stil.2023.234200

Resumo

With the growing importance of the use of information from electronic patient records in the development of machine learning models, there is also a need for a holistic understanding of those records, in particular abridging the clinical notes so that important information is used in the training process without the repetition that is commonly found in such notes. This paper presents the pre-processing of clinical notes from the BRATECA Dataset, a Brazilian tertiary care data collection, aiming at removing repeated information resulting from the interaction between healthcare providers and patients, considering assigned values of semantic similarity between sentences in clinical notes.

Palavras-chave: Healthcare, Semantic Similarity, Electronic Patient Records, Abridging Information

Referências

Consoli, B., dos Santos, H. D. P., Ulbrich, A. H. D. P. S., Vieira, R., and Bordini, R. H. (2022). BRATECA (Brazilian tertiary care dataset): a clinical information dataset for the Portuguese language. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5609–5616, Marseille, France. European Language Resources Association.

Mutinda, F., Yada, S., Wakamiya, S., and Aramaki, E. (2021). Semantic textual similarity in japanese clinical domain texts using bert.

Real, L., F. E. G. O. H. (2021). The assin 2 shared task: A quick overview. Methods Inf Med. https://doi.org/10.1007/978-3-030-41505-1_39

Schneider, E., Souza, J., Knafou, J., Copara, J., Oliveira, L., Gumiel, Y., Ferro Antunes de Oliveira, L., Teodoro, D., Paraiso, E., and Moro, C. (2020). Biobertpt – a portuguese neural language model for clinical named entity recognition. pages 65–72. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.clinicalnlp-1.7

Shamout F, Zhu T, C. D. (2021). Machine learning for clinical outcome prediction. volume 14, pages 116–126. Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/RBME.2020.3007816