Semantic Textual Similarity for Abridging Clinical Notes in Brazilian Electronic Health Records


With the growing importance of the use of information from electronic patient records in the development of machine learning models, there is also a need for a holistic understanding of those records, in particular abridging the clinical notes so that important information is used in the training process without the repetition that is commonly found in such notes. This paper presents the pre-processing of clinical notes from the BRATECA Dataset, a Brazilian tertiary care data collection, aiming at removing repeated information resulting from the interaction between healthcare providers and patients, considering assigned values of semantic similarity between sentences in clinical notes.
Palavras-chave: Healthcare, Semantic Similarity, Electronic Patient Records, Abridging Information


Consoli, B., dos Santos, H. D. P., Ulbrich, A. H. D. P. S., Vieira, R., and Bordini, R. H. (2022). BRATECA (Brazilian tertiary care dataset): a clinical information dataset for the Portuguese language. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5609–5616, Marseille, France. European Language Resources Association.

Mutinda, F., Yada, S., Wakamiya, S., and Aramaki, E. (2021). Semantic textual similarity in japanese clinical domain texts using bert.

Real, L., F. E. G. O. H. (2021). The assin 2 shared task: A quick overview. Methods Inf Med.

Schneider, E., Souza, J., Knafou, J., Copara, J., Oliveira, L., Gumiel, Y., Ferro Antunes de Oliveira, L., Teodoro, D., Paraiso, E., and Moro, C. (2020). Biobertpt – a portuguese neural language model for clinical named entity recognition. pages 65–72. Association for Computational Linguistics.

Shamout F, Zhu T, C. D. (2021). Machine learning for clinical outcome prediction. volume 14, pages 116–126. Institute of Electrical and Electronics Engineers Inc.
BANDEIRA, Lucas T.; CONSOLI, Bernardo S.; VIEIRA, Renata; BORDIN, Rafael H.. Semantic Textual Similarity for Abridging Clinical Notes in Brazilian Electronic Health Records. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 14. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 224-228. DOI: