Noise in Brazilian Clinical Anamnesis: An Empirical Study

  • Leandro A. Carvalho UFC
  • Thiago Q. Oliveira IFCE
  • Flávio R. C. Sousa UFC
  • João B. F. Filho UFC

Resumo


Research Context: The lack of representative data can limit the development of robust clinical Natural Language Processing (NLP) models, as models trained on idealized data can perform poorly on noisy real-world Electronic Health Records (EHRs). Scientific and/or Practical Problem: A performance gap exists when these NLP models are deployed on noisy, real-world clinical text. This issue can be found in less-resourced languages, such as Brazilian Portuguese, where the scarcity of data can limit the development of effective clinical information systems. Proposed Solution and/or Analysis: This study addresses this challenge by presenting a systematic approach to identify and quantify textual noise patterns found in Brazilian Portuguese clinical narratives. Related IS Theory: Based in Task-Technology Fit (TTF) Theory, this study investigates the misalignment between the task of reliable information extraction from noisy EHRs and the technology of NLP models, which can presuppose clean data. Research Method: A multi-stage methodology was employed to identify textual noise. Starting with a classification stage to flag candidate tokens likely representing typos and abbreviations, followed by a lexicon-based validation executed to refine this selection, ensuring that only authentic noise instances were selected. Summary of Results: The analysis of a dataset of clinical anamneses revealed not only a high incidence of textual noise, but also a consistent recurrence of specific noisy tokens across the dataset, demonstrating the widespread nature of data quality issues in this domain. Contributions and Impact to IS area: A taxonomy of textual noise, complemented by two JSON files that structurally map the noisy tokens, establishing an empirical benchmark for Brazilian Portuguese clinical text and formalizing the data quality challenges that must be overcome for successful NLP implementation.

Referências

Barrus, T. (2025). pyspellchecker. [link]. Accessed: July 20, 2025.

Cai, T., Giannopoulos, A. A., Yu, S., Kelil, T., Ripley, B., Kumamaru, K. K., Rybicki, F. J., and Mitsouras, D. (2016). Natural language processing technologies in radiology research and clinical applications. Radiographics, 36(1):176–191.

Crema, C., Attardi, G., Sartiano, D., and Redolfi, A. (2022). Natural language processing in clinical neuroscience and psychiatry: A review. Frontiers in Psychiatry, 13:946387.

de Oliveira, L. F. A., Pagano, A., e Oliveira, L. E. S., and Moro, C. (2022). Challenges in annotating a treebank of clinical narratives in brazilian portuguese. In International Conference on Computational Processing of the Portuguese Language, pages 90–100, Cham. Springer International Publishing.

Draper, T. C., Leake, J., Lamb-Riddell, K., Cox, T., McCormick, J., Trowell, S., Kiely, J., and Luxton, R. (2025). The impact of acoustic and informational noise on ai-generated clinical summaries. medRxiv, pages 2025–03.

Hasan, M. A., Tarannum, P., Dey, K., Razzak, I., and Naseem, U. (2024). Do large language models speak all languages equally? a comparative study in low-resource settings. arXiv preprint arXiv:2408.02237.

Johnson, A. E., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T. J., Hao, S., Moody, B., Gow, B., et al. (2023). Mimic-iv, a freely accessible electronic health record dataset. Scientific data, 10(1):1.

Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L.-w. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Anthony Celi, L., and Mark, R. G. (2016). Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9.

Johnson, A. E., Stone, D. J., Celi, L. A., and Pollard, T. J. (2018). The mimic code repository: enabling reproducibility in critical care research. Journal of the American Medical Informatics Association, 25(1):32–39.

Johnson, S. B., Bakken, S., Dine, D., Hyun, S., Mendonça, E., Morrison, F., Bright, T., Van Vleck, T., Wrenn, J., and Stetson, P. (2008). An electronic health record based on structured narrative. Journal of the American Medical Informatics Association, 15(1):54–64.

Juhn, Y. and Liu, H. (2020). Artificial intelligence approaches using natural language processing to advance ehr-based clinical research. Journal of Allergy and Clinical Immunology, 145(2):463–469.

Leaman, R., Khare, R., and Lu, Z. (2015). Challenges in clinical natural language processing for automated disorder normalization. Journal of biomedical informatics, 57:28–37.

Liu, F., Weng, C., and Yu, H. (2012). Natural language processing, electronic health records, and clinical research. In Clinical research informatics, pages 293–310. Springer London.

Lopes, F., Teixeira, C., and Oliveira, H. G. (2019). Contributions to clinical named entity recognition in portuguese. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 223–233.

Moradi, M. and Samwald, M. (2021). Evaluating the robustness of neural language models to input perturbations. arXiv preprint arXiv:2108.12237.

Nguyen, H. and Patrick, J. (2016). Text mining in clinical domain: Dealing with noise. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 549–558.

Névéol, A., Dalianis, H., Velupillai, S., Savova, G., and Zweigenbaum, P. (2018). Clinical natural language processing in languages other than english: opportunities and challenges. Journal of biomedical semantics, 9(1):12.

Oliveira, L. E. S. E., Peters, A. C., Da Silva, A. M. P., Gebeluca, C. P., Gumiel, Y. B., Cintho, L. M. M., Carvalho, D. R., Al Hasan, S., and Moro, C. M. C. (2022). Semclinbr-a multi-institutional and multi-specialty semantically annotated corpus for portuguese clinical nlp tasks. Journal of Biomedical Semantics, 13(1):13.

Pereira, D. A. (2021). A survey of sentiment analysis in the portuguese language. Artificial Intelligence Review, 54(2):1087–1115.

Presidência da República (2024). Lei nº 14.874 de 28 de maio de 2024. [link]. Accessed: July 20, 2025.

Reis, E. P., De Paiva, J. P., Da Silva, M. C., Ribeiro, G. A., Paiva, V. F., Bulgarelli, L., Lee, H. M., Santos, P. V., Brito, V. M., Amaral, L. T., et al. (2022). Brax, brazilian labeled chest x-ray dataset. Scientific Data, 9(1):487.

Sheikhalishahi, S., Miotto, R., and Dudley, J. T. (2019). Natural language processing of clinical notes on chronic diseases: systematic review. JMIR Medical Informatics, 8(2):e12239.

Shickel, B., Tighe, P. J., Bihorac, A., and Rashidi, P. (2017). Deep ehr: a survey of recent advances in deep learning techniques for electronic health record (ehr) analysis. IEEE journal of biomedical and health informatics, 22(5):1589–1604.

Smith, W. (2025). Applied Deep Learning for Natural Language Processing with AllenNLP: The Complete Guide for Developers and Engineers. HiTeX Press.

Uzuner, Ö., South, B. R., Shen, S., and DuVall, S. L. (2011). 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association, 18(5):552–556.

Wei, Y., Deng, Y., Sun, C., Lin, M., Jiang, H., and Peng, Y. (2024). Deep learning with noisy labels in medical prediction problems: a scoping review. Journal of the American Medical Informatics Association, 31(7):1596–1607.

Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., Wesslén, A., et al. (2012). Experimentation in software engineering, volume 236. Springer.

Zeng, Q. T., Redd, D., Divita, G., Jarad, S., Brandt, C., and Nebeker, J. R. (2011). Characterizing clinical text and sublanguage: A case study of the va clinical notes. J Health Med Informat S, 3(2).

Zeng-Treitler, Q., Kim, H., Goryachev, S., Keselman, A., Slaughter, L., and Smith, C.-A. (2007). Text characteristics of clinical reports and their implications for the readability of personal health records. Studies in health technology and informatics, 129(2):1117.
Publicado
25/05/2026
CARVALHO, Leandro A.; OLIVEIRA, Thiago Q.; SOUSA, Flávio R. C.; F. FILHO, João B.. Noise in Brazilian Clinical Anamnesis: An Empirical Study. In: SIMPÓSIO BRASILEIRO DE SISTEMAS DE INFORMAÇÃO (SBSI), 22. , 2026, Vitória/ES. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2026 . p. 401-418. DOI: https://doi.org/10.5753/sbsi.2026.248362.

Artigos mais lidos do(s) mesmo(s) autor(es)