Annotation of Clinical Narratives according to the Universal Dependencies guidelines

  • Adriana Pagano UFMG
  • Carlos A. S. Perini UFMG
  • Cláudia Benevenute IFES
  • Cristiano Colombo IFES

Abstract


This is an ongoing study on Natural Language Processing of a corpus of Clinical Narratives in Brazilian Portuguese with two annotated versions: one by a machine and the other by humans. The frequency of POS and dependency relations of tokens in each version is calculated, and a corpus-driven analysis is performed, highlighting the corrections of the machine annotations made by the human. The comparison of these annotations allows the creation of treebanks that can be used to train new models using machine learning techniques and to improve various Natural Language Processing applications with corpora from the biomedical field. In addition, this comparison allows the analysis of the theoretical consistency of annotation to uncover the grammatical system of this type of corpus and to create annotation guides for Clinical Narratives in Brazilian Portuguese according to Universal Dependencies.

References

BIBER, D. (2015). Corpus-based and corpus-driven analyses of language variation and use. In HEINE, B. and NARROG, H., editors, The Oxford Handbook of Linguistic Analysis. Oxford Academic, 2nd edition.

DALIANIS, H. (2018). Characteristics of patient records and clinical corpora. In Clinical Text Mining. Springer, Cham.

DURAN, M. S., NUNES, M. d. G. V., LOPES, L., and PARDO, T. A. S. (2022). Manual de anotação como recurso de processamento de linguagem natural: o modelo universal dependencies em língua portuguesa. Domínios de Lingu@gem, 16(4):1608–1643.

MARNEFFE, M. et al. (2021). Universal dependencies. Computational Linguistics, 47(2):255–308.

MOON, S., Pakhomov, S., Ryan, J., and Melton, G. B. (2011). Automated nonalphanumeric symbol resolution in clinical texts. AMIA Annual Symposium Proceedings, pages 979–986.

NÉVÉOL, A., DALIANIS, H., VELUPILLAI, S., et al. (2018). Clinical natural language processing in languages other than english: opportunities and challenges. Journal of Biomedical Semantics, 9(1):12.

OLIVEIRA, L. E. S., PETERS, A. C., DA SILVA, A. M. P., et al. (2022a). Semclinbr a multi-institutional and multi-specialty semantically annotated corpus for portuguese clinical nlp tasks. Journal of Biomedical Semantics, 13(1):13.

OLIVEIRA, L. F. A. d., OLIVEIRA, L. E. S. d., and MORO, C. (2022b). Challenges in annotating a treebank of clinical narratives in brazilian portuguese. In PINHEIRO, V., GAMALLO, P., AMARO, R., SCARTON, C., BATISTA, F., SILVA, D., MAGRO, C., and PINTO, H., editors, Computational Processing of the Portuguese Language, pages 90–100, Cham. Springer International Publishing.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ”why should i trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, New York, NY, USA. Association for Computing Machinery.

STYLER, W. F., BETHARD, S., FINAN, S., PALMER, M., PRADHAN, S., de GROEN, P. C., ERICKSON, B., MILLER, T., LIN, C., SAVOVA, G., and PUSTEJOVSKY, J. (2014). Temporal annotation in the clinical domain. Transactions of the Association for Computational Linguistics, 2:143–154.

XIA, F. and YETISGEN-YILDIZ, M. (2012). Clinical corpus annotation: Challenges and strategies. In Proceedings of the 3rd Workshop on Building and Evaluating Resources for Biomedical Text Mining, Istanbul. European Language Resources Association.
Published
2025-09-29
PAGANO, Adriana; PERINI, Carlos A. S.; BENEVENUTE, Cláudia; COLOMBO, Cristiano. Annotation of Clinical Narratives according to the Universal Dependencies guidelines. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 557-563. DOI: https://doi.org/10.5753/stil.2025.37857.