A hierarchical model for automatic Neoplasm ICD coding

  • Miguel Díaz Iturry USP
  • Solange N. Alves-Souza USP
  • Marcia Ito CEETEPS
  • Suzana Alves da Silva HCor Associação Beneficiente Siria


International Classification of Diseases (ICD) codes are used for different management activities in hospitals. Previous researches employed Machine Learning (ML) models for automatic coding to simplify the disease code assignation process; nevertheless, model performance was compromised due to problems with label imbalance and the high number of labels. In the present research, a Support Vector Machine (SVM) model for Neoplasm ICD coding was trained with a dataset previously treated by applying re-sampling methods to mitigate label imbalance issues and increase the model sensitivity. To mitigate the issue with the high number of labels, human body location information contained in the medical records and ICD code descriptions were employed to build a hierarchical model, which improved the performance of a base non-hierarchical model by up to 15%.


Aggarwal, C. C. (2018). Machine Learning for Text. Springer International Publishing, Cham.

Azam, S. S., Raju, M., Pagidimarri, V., and Kasivajjala, V. C. (2020). Cascadenet: An lstm based deep learning model for automated icd-10 coding. In Arai, K. and Bhatia, R., editors, Advances in Information and Communication, pages 55-74, Cham. Springer International Publishing.

Barbosa, W. L., Alves-Souza, S. N., Correa-Pizzigatti, P., and DeSouza, L. S. (2019). Data quality problems identified in the bioclimatic data collection process a survey. In 2019 14th Iberian Conference on Information Systems and Technologies (CISTI), pages 1-7.

Batini, C., Cappiello, C., Francalanci, C., and Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys, 41(3):1-52.

Eisenstein, J. (2018). Introduction to Natural Language Processing. The MIT Press.

Gupta, N., Mujumdar, S., Patel, H., Masuda, S., Panwar, N., Bandyopadhyay, S., Mehta, S., Guttula, S., Afzal, S., Sharma Mittal, R., and Munigala, V. (2021). Data quality for machine learning tasks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD '21, pages 4040-4041, New York, NY, USA. Association for Computing Machinery.

Jain, A., Patel, H., Nagalapatti, L., Gupta, N., Mehta, S., Guttula, S., Mujumdar, S., Afzal, S., Sharma Mittal, R., and Munigala, V. (2020). Overview and importance of data quality for machine learning tasks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '20, pages 3561-3562, New York, NY, USA. Association for Computing Machinery.

Kavuluru, R., Rios, A., and Lu, Y. (2015). An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records. Artificial Intelligence in Medicine, 65:155-166.

Lee, J. M. and Muis, A. O. (2017). Diagnosis code prediction from electronic health records as multilabel text classification: A survey.

Li, M., Fei, Z., Zeng, M., Wu, F., Li, Y., Pan, Y., and Wang, J. (2019). Automated icd9 coding via a deep learning approach. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 16:1193-1202.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825-2830.

Xie, P. and Xing, E. (2018). A neural architecture for automated ICD coding. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1066-1076, Melbourne, Australia. Association for Computational Linguistics.

Xu, K., Lam, M., Pang, J., Gao, X., Band, C., Mathur, P., Papay, F., Khanna, A. K., Cywinski, J. B., Maheshwari, K., Xie, P., and Xing, E. P. (2018). Multimodal machine learning for automated ICD coding. CoRR, abs/1810.13348.

Zhong, J., Gao, C., and Yi, X. (2018). Categorization of patient disease into icd-10 with nlp and svm for chinese electronic health record analysis. In Proceedings of the 2018 International Conference on Artificial Intelligence and Pattern Recognition, AIPR 2018, pages 101-106, New York, NY, USA. Association for Computing Machinery.
ITURRY, Miguel Díaz; ALVES-SOUZA, Solange N.; ITO, Marcia; SILVA, Suzana Alves da. A hierarchical model for automatic Neoplasm ICD coding. In: SIMPÓSIO BRASILEIRO DE COMPUTAÇÃO APLICADA À SAÚDE (SBCAS), 22. , 2022, Teresina. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 381-390. ISSN 2763-8952. DOI: https://doi.org/10.5753/sbcas.2022.222705.

Artigos mais lidos do(s) mesmo(s) autor(es)