Similarity Search and Correlation-Based Exploratory Analysis in EHRs: A Case Study with COVID-19 Databases


With the COVID-19 pandemic, many hospitals have collected Electronic Health Records (EHRs) from patients and shared them publicly. EHRs include heterogeneous attribute types, such as image exams, numerical, textual, and categorical information. Simply posing similarity queries over EHRs can underestimate the semantics and potential information of particular attributes and thus would be best supported by exploratory data analysis methods. Thus, we propose the Sketch method for comparing EHRs by similarity to provide a tool for a correlation-based exploratory analysis over different attributes. Sketch computes the overall data correlation considering the distance space of every attribute. Further, it employs both ANOVA and association rules with lift correlations to study the relationship between variables, allowing a deep data analysis. As a case study, we employed two open databases of COVID-19 cases, showing that specialists can benefit from the inference modules of Sketch to analyze EHRs. Sketch found strong correlations among tuples and attributes, with statistically significant results. The exploratory analysis has shown to complement the similarity search task, identifying and evaluating patterns discovered from heterogeneous attributes.
Palavras-chave: Exploratory data analysis, correlation, electronic health records, COVID-19


Cohen, J. P. et al. (2020). Covid-19 image data collection: Prospective predictions are the future. arXiv 2006.11988.

Deza, M. M. and Deza, E. (2009). Encyclopedia of distances. InEncyclopedia of distances, pages 1–583. Springer. DOI: 10.1007/978-3-642-00234-2.

DSouza, J. and Velan S., S. (2020). Using exploratory data analysis for generating inferences on the correlation of covid-19 cases. In ICCCNT Conference, pages 1–6. IEEE. DOI: 10.1109/ICCCNT49239.2020.9225621.

FAPESP(2020).FAPESPCOVID-9 DataSharing/BR.

Farias, J. d., Barioni, M. C., and Rezende, H. (2019). Explorando o uso de árvores b+ na indexação de dados por similaridade. In SBBD Conference, pages 163–168, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2019.8817.

Gansel, X., Mary, M., and van Belkum, A. (2019). Semantic data interoperability, digital medicine, and e-health in infectious disease management: a review. EJCMID Journal,38(6):1023–1034. DOI:10.1007/s10096-019-03501-6.

Han, J., Kamber, M., and Pei, J. (2011). Data Mining: Concepts and Techniques, 3rd edition. Morgan Kaufmann.

Hoshen, Y. and Wolf, L. (2018). Unsupervised correlation analysis. InCVPR Conference, pages 3319–3328. DOI: 10.1109/CVPR.2018.00350.

Huang, H., Zhang, R., and Lu, X. (2019). A recommendation model for medical data visualization based on information entropy and decision tree optimized by two correlation coefficients. In ACM ICICM Conference, page 52–56.DOI: 10.1145/3357419.3357436.

Jensen, P. B., Jensen, L. J., and Brunak, S. (2012). Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics,13(6):395–405. DOI: 10.1038/nrg3208.

Kaieski, N., de Oliveira, L. P. L., and Villamil, M. B. (2016). Vis-health: Exploratory analysis and visualization of dengue cases in Brazil. In HICSS Conference, pages3063–3072. IEEE. DOI: 10.1109/HICSS.2016.385.

Samet, H. (2006). Foundations of multidimensional and metric data structures. M. K. series in data management systems. Academic Press.

Xiao, C. et al. (2016). Using spearman’s correlation coefficients for exploratory data analysis on big dataset. CCPE Journal, 28(14):3866–3878. DOI: 10.1002/cpe.3745.

Yadav, P., Steinbach, M., Kumar, V., and Simon, G. (2018). Mining electronic health records (EHRs): A survey. ACM Computing Surveys, 50(6). DOI: 10.1145/3127881.

Yang, F. et al. (2019). Correlation judgment and visualization features: A comparative study. IEEE TVCG Journal, 25(3):1474–1488. DOI:10.1109/TVCG.2018.2810918.

Zhang, H., Hou, Y., Qu, D., and Liu, Q. (2016). Correlation visualization of time-varying patterns for multi-variable data. IEEE Access, 4:4669–4677. DOI: 10.1109/AC-CESS.2016.2601339.
CAZZOLATO, Mirela T.; RODRIGUES, Lucas S.; RIBEIRO, Marcela X.; GUTIERREZ, Marco A.; TRAINA JR., Caetano; TRAINA, Agma J. M.. Similarity Search and Correlation-Based Exploratory Analysis in EHRs: A Case Study with COVID-19 Databases. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 36. , 2021, Rio de Janeiro. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 25-36. ISSN 2763-8979. DOI: