Exploratory Analysis of Electronic Health Records using Topic Modeling


  • Ivair Puerari Federal University of Fronteira Sul
  • Denio Duarte Federal University of Fronteira Sul
  • Guilherme Dal Bianco Federal University of Fronteira Sul
  • Julyane Felipette Lima Federal University of Fronteira Sul




Topic Modeling, Electronic Health Record, ICU, LDA, Discharge, Death


The rapid growth of electronic health record (EHR) systems brings an increase in available information about patients in hospitals. This massive amount of text information presents an opportunity to extract unknown information about medical history, medication, diseases, allergies, among others. Extracting the main topics that represent the subjects covered by a text collection can give valuable insights. To this end, approaches for topic modeling have been used to tackle such problems as information discovery and topic extraction with thematic information. In this context, this work presents an exploratory analysis of a collection of electronic health records from an intensive care unit (ICU). The collection is split into two sub-collections: discharged patients and patients who progressed to death. We apply an LDA-based approach to discover the latent topics from the collections. The analyses show that some topics are more recurrent in the deceased patients (the death collection), like renal diseases, and others are more recurrent in the discharge collection, for example, diabetes. The results of the analyses can be useful for improving intensive care services since the topics can be a guide to understanding the patterns in discharge and death situations.


Download data is not yet available.


Alhawarat, M. and Hegazi, M. Revisiting k-means and topic modeling, a comparison study to cluster arabic documents. IEEE Access vol. 6, pp. 42740–42749, 2018.

Arnold, C. W., El-Saden, S. M., Bui, A. A., and Taira, R. Clinical case-based retrieval using latent topic analysis. In AMIA annual symposium. Vol. 2010. American Medical Informatics Association, pp. 26, 2010.

Awad, A., Bader-El-Den, M., McNicholas, J., and Briggs, J. Early hospital mortality prediction of intensive care unit patients using an ensemble learning approach. International journal of medical informatics vol. 108, pp. 185–195, 2017.

Bai, T., Chanda, A. K., Egleston, B. L., and Vucetic, S. Joint learning of representations of medical concepts and words from ehr data. In 2017 IEEE BIBM. pp. 764–769, 2017.

Blei, D. M. Probabilistic topic models. Commun. ACM 55 (4): 77–84, 2012.

Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of machine Learning research 3 (Jan): 993–1022, 2003.

Chan, K. R., Lou, X., Karaletsos, T., Crosbie, C., Gardos, S., Artz, D., and Rätsch, G. An empirical analysis of topic modeling for mining cancer clinical notes. In 13th IEEE ICDMW. IEEE, pp. 56–63, 2013.

Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., and Blei, D. M. Reading tea leaves: How humans interpret topic models. In Proceedings of the 23th NIPS. pp. 288–296, 2009.

Charmaz, K. A construção da teoria fundamentada: guia prático para análise qualitativa. Bookman Editora, 2009.

Chen, J., Wei, W., Guo, C., Tang, L., and Sun, L. Textual analysis and visualization of research trends in data mining for electronic health records. Health Policy and Technology 6 (4): 389–400, 2017.

Chen, Y., Bordes, J.-B., and Filliat, D. An experimental comparison between nmf and lda for active cross-situational object-word learning. In 2016 Joint IEEE ICDL-EpiRob. IEEE, pp. 217–222, 2016.

Chen, Y., Ghosh, J., Bejan, C. A., Gunter, C. A., Gupta, S., Kho, A., Liebovitz, D., Sun, J., Denny, J., and Malin, B. Building bridges across electronic health record systems through inferred phenotypic topics. Journal of Biomedical Informatics vol. 55, pp. 82 – 93, 2015.

Chertow, G., Soroko, S., Paganini, E., Cho, K., Himmelfarb, J., Ikizler, T., and Mehta, R. Mortality after acute renal failure: Models for prognostic stratification and risk adjustment. Kidney International 70 (6): 1120–1126, 2006.

da Silva, A., Hummel, J. R., Cabral, T. S., Carvalho, C. C. R., and Busanello, J. Índices de sedação e ventilação mecânica em paciente sob cuidados intensivos. In Salão Internacional de Ensino, Pesquisa e Extensão. Vol. 11. Unipampa, 2020.

Dare, A. J., Fu, S. H., Patra, J., Rodriguez, P. S., Thakur, J. S., and Jha, P. Renal failure deaths and their risk factors in india 2001–13: nationally representative estimates from the million death study. The Lancet Global Health 5 (1): 89–95, 2017.

Denaxas, S. C., Asselbergs, F. W., and Moore, J. H. The tip of the iceberg: challenges of accessing hospital electronic health record data for biological data mining. BioData Mining 9 (29), 2016.

Ding Cheng, L., Thermeau, T., Chute, C., and Liu, H. Discovering associations among diagnosis groups using topic modeling. In AMIA Joint Summits on Translational Science. pp. 43–49, 2014.

Doshi-Velez, F., Ge, Y., and Kohane, I. Comorbidity clusters in autism spectrum disorders: an electronic health record time-series analysis. Pediatrics 133 (1): e54–e63, 2014.

Duarte, D. and Ståhl, N. Machine learning: a concise overview. In Data Science in Practice, A. Said and V. Torra (Eds.). Springer, pp. 27–58, 2019.

Fung, B. C., Wang, K., and Ester, M. Hierarchical document clustering using frequent itemsets. In Proceedings of the 2003 SIAM international conference on data mining. SIAM, pp. 59–70, 2003.

Gotz, D., Sun, J., Cao, N., and Ebadollahi, S. Visual cluster analysis in support of clinical decision intelligence. In AMIA Annual Symposium Proceedings. Vol. 2011. American Medical Informatics Association, pp. 481, 2011.

Guariguata, L., Whiting, D. R., Hambleton, I., Beagley, J., Linnenkamp, U., and Shaw, J. E. Global estimates of diabetes prevalence for 2013 and projections for 2035. Diabetes research and clinical practice 103 (2): 137–149, 2014.

Huang, S., Niu, Z., and Shi, Y. Product features categorization using constrained spectral clustering. In International Conference on Application of Natural Language to Information Systems. Springer, pp. 285–290, 2013.

Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., and Zhao, L. Latent dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications 78 (11): 15169–15211, 2019.

Johnson, A. E., Pollard, T. J., Shen, L., Li-wei, H. L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. MIMIC-III, a freely accessible critical care database. Scientific data vol. 3, pp. 1–9, 2016.

Kalankesh, L., Weatherall, J., Ba-Dhfari, T., Buchan, I. E., and Brass, A. Taming ehr data: using semantic similarity to reduce dimensionality. In MedInfo. pp. 52–56, 2013.

Kane, R. L., Shamliyan, T. A., Mueller, C., Duval, S., and Wilt, T. J. The association of registered nurse staffing levels and patient outcomes: systematic review and meta-analysis. Medical care 45 (12): 1195–1204, 2007.

Kim, S., Kim, W., and Park, R. W. A comparison of intensive care unit mortality prediction models through the use of data mining techniques. Healthcare informatics research 17 (4): 232–243, 2011.

Koye, D. N., Magliano, D. J., Nelson, R. G., and Pavkov, M. E. The global epidemiology of diabetes and kidney disease. Advances in Chronic Kidney Disease 25 (2): 121 – 132, 2018. Diabetic Kidney Disease (c. 2018).

Kuang, D., Choo, J., and Park, H. Nonnegative matrix factorization for interactive topic modeling and document clustering. In Partitional Clustering Algorithms. Springer, pp. 215–243, 2015.

Lau, J. H., Newman, D., and Baldwin, T. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the 14th EACL. pp. 530–539, 2014.

Lehman, L.-w., Long, W., Saeed, M., and Mark, R. Latent topic discovery of clinical concepts from hospital discharge summaries of a heterogeneous patient cohort. In 36th Annual International Conference of the IEEE EMBS. IEEE, pp. 1773–1776, 2014.

Lehman, L.-w., Saeed, M., Long, W., Lee, J., and Mark, R. Risk stratification of ICU patients using topic models inferred from unstructured progress notes. In AMIA annual symposium proceedings. Vol. 2012. American Medical Informatics Association, pp. 505, 2012.

Lu, H.-M., Wei, C.-P., and Hsiao, F.-Y. Modeling healthcare data using multiple-channel latent dirichlet allocation. Journal of biomedical informatics vol. 60, pp. 210–223, 2016.

Luo, M., Nie, F., Chang, X., Yang, Y., Hauptmann, A., and Zheng, Q. Probabilistic non-negative matrix factorization and its robust extensions for topic modeling. In 31st AAAI conference on artificial intelligence, 2017.

Meskó, B., Drobni, Z., Bényei, É., Gergely, B., and Győrffy, Z. Digital health is a cultural transformation of traditional healthcare. Mhealth vol. 3, pp. 3–38, 2017.

Mihaela Coroiu, A., Delia Călin, A., and Nuţu, M. Topic modeling in medical data analysis. Case study based on medical records analysis. In 2019 International SoftCOM, 2019.

M’sik, B. and Casablanca, B. M. Topic modeling coherence: A comparative study between lda and nmf models using covid’19 corpus. International Journal 9 (4), 2020.

Perotte, A. J., Wood, F., Elhadad, N., and Bartlett, N. Hierarchically supervised latent dirichlet allocation. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., pp. 2609–2617, 2011.

Piskorski, J. and Yangarber, R. Information extraction: Past, present and future. In Multi-source, Multilingual Information Extraction and Summarization, T. Poibeau, H. Saggion, J. Piskorski, and R. Yangarber (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 23–49, 2013.

Řehůřek, R. and Sojka, P. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, pp. 45–50, 2010.

Röder, M., Both, A., and Hinneburg, A. Exploring the space of topic coherence measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. WSDM ’15, 2015a.

Röder, M., Both, A., and Hinneburg, A. Exploring the space of topic coherence measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. Association for Computing Machinery, USA, pp. 399–408, 2015b.

Roque, F. S., Jensen, P. B., Schmock, H., Dalgaard, M., Andreatta, M., Hansen, T., Søeby, K., Bredkjær, S., Juul, A., Werge, T., et al. Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS Comput Biol 7 (8): e1002141, 2011.

Sena, M. R. D., Chahini, M., Braum, M. K., de Lima, S. M. M., Pimentel, S. K. S., Siqueira, V. A., et al. Mortalidade neonatal em hospitais públicos de alta e média complexidade no baixo amazonas. Revista Eletrônica Acervo Saúde 12 (5): e2286, 2020.

Steyvers, M. and Griffiths, T. Probabilistic topic models. In Handbook of latent semantic analysis, T. K. Landauer, D. S. McNamara, S. Dennis, and W. Kintsch (Eds.). Laurence Erlbaum Associates, 21, pp. 424–440, 2007.

Suri, P. and Roy, N. R. Comparison between LDA & NMF for event-detection from large text stream data. In 2017 3rd CICT. IEEE, pp. 1–5, 2017.

Valenti, A. P., Chita-Tegmark, M., Tickle-Degnen, L., Bock, A. W., and Scheutz, M. J. Using topic modeling to infer the emotional state of people living with parkinson’s disease. Assistive Technology, 2019.

Xie, P. and Xing, E. P. Integrating document clustering and topic modeling. In Proceedings of the Twenty-Ninth Conference Uncertainty In Artificial Intelligence. Association for Uncertainty in Artificial Intelligence (AUAI), 2013.

Yadav, P., Steinbach, M., Kumar, V., and Simon, G. Mining electronic health records (EHRs) a survey. ACM Computing Surveys (CSUR) 50 (6): 1–40, 2018.

Zhang, Y., Jiang, R., and Petzold, L. Survival topic models for predicting outcomes for trauma patients. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). IEEE, pp. 1497–1504, 2017.

Zhao, J., Feng, Q., Wu, P., Warner, J. L., Denny, J. C., and Wei, W.-Q. Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of lipoprotein(a) (LPA). PLOS ONE vol. 14, pp. 1–15, 02, 2019.




How to Cite

Puerari, I., Duarte, D., Dal Bianco, G., & Felipette Lima, J. (2021). Exploratory Analysis of Electronic Health Records using Topic Modeling. Journal of Information and Data Management, 11(2). https://doi.org/10.5753/jidm.2020.2024



Regular Papers