Automated classification of cardiology diagnoses based on textual medical reports


  • J. A. O. Pedrosa Universidade Federal de Minas Gerais
  • D. M. Oliveira Universidade Federal de Minas Gerais
  • Wagner Meira Jr. Universidade Federal de Minas Gerais
  • Antonio Luiz P. Ribeiro Universidade Federal de Minas Gerais



cardiology, information extraction, machine learning, natural language processing


Automatic classification of diagnoses has been a long term challenge for Computer Science and related disciplines. Textual clinical reports can be used as a great source of data for such diagnoses. However, building classification models from them is not a trivial task. The problem tackled in this work is the identification of the medical diagnoses that are indicated in these reports. In the past, several methods have been proposed for addressing this problem, but a method developed for reports in the cardiology area that are written in Portuguese is still needed. In this paper we describe a method that is able to handle the peculiarities of clinical reports, including the medical terminology, and that is implemented to estimate correctly the diagnosis based on raw clinical reports and a list of the possible diagnoses. Experimental results show that our method has a high degree of accuracy, even for infrequent
classes and complex databases.


Download data is not yet available.


Alkmim, M. B., Figueira, R. M., Marcolino, M. S., Cardoso, C. S., Abreu, M. P. d., Cunha, L. R., Cunha, D. F. d., Antunes, A. P., Resende, A. G. d. A., Resende, E. S., et al. Improving patient access to specialized health care: the telehealth network of minas gerais, brazil. Bulletin of the World Health Organization vol. 90, pp. 373–378, 2012.

Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., and Kochut, K. A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919 , 2017.

Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 , 2014.

Baud, R., Rassinoux, A.-M., and Scherrer, J.-R. Natural language processing and semantical representation of medical texts. Methods of information in medicine 31 (02): 117–125, 1992.

Collins, M. Three generative, lexicalised models for statistical parsing. arXiv preprint cmp-lg/9706022 , 1997.

Dang PA, Kalra MK, B. M. e. a. Natural language processing using online analytic processing for assessing recommendations in radiology reports. J Am Coll Radiol vol. 5,3, pp. 197-204, 2008.

Fan, J., Upadhye, S., and Worster, A. Understanding receiver operating characteristic (roc) curves. Canadian Journal of Emergency Medicine 8 (1): 19–20, 2006.

Ford, E., Nicholson, A., Koeling, R., Tate, A. R., Carroll, J., Axelrod, L., Smith, H. E., Rait, G., Davies, K. A., Petersen, I., et al. Optimising the use of electronic health records to estimate the incidence of rheumatoid arthritis in primary care: what information is hidden in free text? BMC medical research methodology 13 (1): 105, 2013.

Friedman, C. Towards a comprehensive medical language processing system: methods and issues. In Proceedings of the AMIA annual fall symposium. American Medical Informatics Association, pp. 595, 1997.

Friedman, C., Hripcsak, G., DuMouchel, W., Johnson, S. B., and Clayton, P. D. Natural language processing in an operational clinical information system. Natural Language Engineering 1 (1): 83–108, 1995.

Gabrieli, E. R. and Speth, D. J. Automated analysis of medical text i. clue gathering. Journal of medical systems 14 (1-2): 71–91, 1990.

Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.

Harris, Z. S. Distributional structure. Word 10 (2-3): 146–162, 1954.

Hassanpour, S. and Langlotz, C. P. Information extraction from multi-institutional radiology reports. Artificial intelligence in medicine vol. 66, pp. 29–39, 2016.

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation 9 (8): 1735–1780, 1997.

Hripcsak, G., Friedman, C., Alderson, P. O., DuMouchel, W., Johnson, S. B., and Clayton, P. D. Unlocking clinical data from narrative reports: a study of natural language processing. Annals of internal medicine 122 (9): 681–688, 1995.

Hu, R. and Singh, A. Transformer is all you need: Multimodal multitask learning with a unified transformer. arXiv preprint arXiv:2102.10772 , 2021.

Hughes, N. P., Tarassenko, L., and Roberts, S. J. Markov models for automated ecg interval analysis. In Advances in Neural Information Processing Systems. pp. 611–618, 2004.

Jagannatha, A. N. and Yu, H. Structured prediction models for rnn based sequence labeling in clinical text. In Proceedings of the conference on empirical methods in natural language processing. conference on empirical methods in natural language processing. Vol. 2016. NIH Public Access, pp. 856, 2016.

Karita, S., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., Someki, M., Soplin, N. E. Y., Yamamoto, R., Wang, X., et al. A comparative study on transformer vs rnn in speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, pp. 449–456, 2019.

Klein, D. and Manning, C. D. Accurate unlexicalized parsing. In Proceedings of the 41st annual meeting of the association for computational linguistics. pp. 423–430, 2003.

Mamlin, B. W., Heinze, D. T., and McDonald, C. J. Automated extraction and normalization of findings from cancer-related free-text radiology reports. In AMIA Annual Symposium Proceedings. Vol. 2003. American Medical Informatics Association, pp. 420, 2003.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. pp. 3111–3119, 2013.

Paixao, G., Silva e Silva, L. G., Gomes, P., Ferreira, M., Oliveira, D., Ribeiro, M., Ribeiro, A., Nascimento, J., Cardoso, G., Araujo, R., et al. Clinical outcomes in digital electrocardiography: Evaluation of mortality in atrial fibrillation (code study). Circulation 138 (Suppl_1): A16594–A16594, 2018.

Pedrosa, J. A. O., Oliveira, D., Meira Jr, W., and Ribeiro, A. Automated classification of cardiology diagnoses based on textual medical reports. In Anais do VIII Symposium on Knowledge Discovery, Mining and Learning. SBC, pp. 185–192, 2020.

Pradhan, N., Gyanchandani, M., and Wadhvani, R. A review on text similarity technique used in ir and its application. International Journal of Computer Applications 120 (9), 2015.

Prince, V. and Roche, M. Information retrieval in biomedicine: natural language processing for knowledge integration. Medical Information Science Reference New York, 2009.

Ramos, J. et al. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning. Vol. 242. Citeseer, pp. 29–48, 2003.

Ribeiro, A. H., Ribeiro, M. H., Paixão, G. M. M., Oliveira, D. M., Gomes, P. R., Canazart, J. A., Ferreira, M. P. S., Andersson, C. R., Macfarlane, P. W., Meira Jr., W., Schön, T. B., and Ribeiro, A. L. P. Automatic diagnosis of the 12-lead ECG using a deep neural network. Nature Communications 11 (1): 1760, 2020.

Souza, R. C., de Brito, D. E., Cardoso, R. L., de Oliveira, D. M., Meira, W., and Pappa, G. L. An evolutionary methodology for handling data scarcity and noise in monitoring real events from social media data. In Ibero-American Conference on Artificial Intelligence. Springer, pp. 295–306, 2014.

Spyns, P. Natural language processing in medicine: an overview. Methods of information in medicine 35 (04/05):285–301, 1996.

Stein HD, Nadkarni P, E. J. M. P. Exploring the degree of concordance of coded and textual data in answering clinical queries from a clinical data repository, 2000.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems. pp. 5998–6008, 2017.

Xu, J. and Sharma, P. Structured report data from a medical text report, 2019. US Patent App. 16/382,358. Yadav, P. Patient report retrieval using semantic lda with cosine similarity. Int. J. Innov. Sci. Eng. Technol. 4 (7): 402–408, 2017.

Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. pp. 1480–1489, 2016.




How to Cite

O. Pedrosa, J. A., Oliveira, D. M., Meira Jr., W., & P. Ribeiro, A. L. (2021). Automated classification of cardiology diagnoses based on textual medical reports. Journal of Information and Data Management, 12(1).



KDMILe 2020