Sketch+ for Visual and Correlation-Based Exploratory Data Analysis: A Case Study with COVID-19 Databases

Authors

  • Mirela T. Cazzolato University of São Paulo
  • Lucas S. Rodrigues University of São Paulo
  • Marcela X. Ribeiro Federal University of São Carlos
  • Marco A. Gutierrez University of São Paulo
  • Caetano Traina Jr. University of São Paulo
  • Agma J. M. Traina University of São Paulo

DOI:

https://doi.org/10.5753/jidm.2022.2484

Keywords:

CBIR, correlation, COVID-19, exploratory data analysis, visualization

Abstract

The amount of data daily generated by different sources grows exponentially and brings new challenges to the information technology experts. The recorded data usually include heterogeneous attribute types, such as the traditional date, numerical, textual, and categorical information, as well as complex ones, such as images, videos, and multidimensional data. Simply posing similarity queries over such records can underestimate the semantics and potential usefulness of particular attributes. In this context, the Exploratory Data Analysis (EDA) technology is well-suited to understand data and perform knowledge extraction and visualization of existing patterns. In this paper, we propose Sketch+ , a technique and a corresponding supporting tool to compare electronic health records (provided by hospitals) by similarity, supporting correlation-based exploratory analysis over attributes of different types and allowing data preprocessing tasks for visualization and knowledge extraction. Sketch+ computes partial and overall data correlation considering distance spaces induced by the attributes. It employs both ANOVA and association rules with lift correlations to study relationships between variables, allowing extensive data analysis. Among the tools provided, a pixel-oriented one drives the analysts to observe visual correlations among dates, categorical and numerical attributes. As a running case study, we employed three open databases of COVID-19 cases, showing that specialists can benefit from the inference modules of Sketch+ to analyze electronic records. The study highlights how Sketch+ can be employed to spot strong correlations among tuples and attributes, with statistically significant results. The exploratory analysis has been shown to be an essential complement for similarity search tasks, identifying and evaluating patterns from heterogeneous attributes.

Downloads

Download data is not yet available.

References

Abdullah, S. S. et al. Visual analytics for dimension reduction and cluster analysis of high dimensional electronic health records. Informatics 7 (2): 17, 2020. DOI: 10.3390/informatics7020017.

Abedjan, Z., Golab, L., and Naumann, F. Profiling relational data: a survey. The VLDB Journal 24 (4): 557–581, 2015. DOI: 10.1007/s00778-015-0389-y.

Afshar, P., Heidarian, S., et al. COVID-CT-MD, COVID-19 computed tomography scan dataset applicable in machine learning and deep learning. Scientific Data 8 (1): 121, 2021. DOI: 10.1038/s41597-021-00900-3.

Bernier, A. and Thorogood, A. Sharing bioinformatic data for machine learning: Maximizing interoperability through license selection. In Bioinformatics. SCITEPRESS, Valletta, Malta, pp. 226–232, 2020. DOI: 10.5220/0009179502260232.

Brownlee, J. Data preparation for machine learning: data cleaning, feature selection, and data transforms in Python. Machine Learning Mastery, San Juan, PR, USA, 2020.

Cazzolato, M., Rodrigues, L., Ribeiro, M., Gutierrez, M., Traina-Jr., C., and Traina, A. J. M. Similarity search and correlation-based exploratory analysis in ehrs: A case study with covid-19 databases. In SBBD Conference. SBC, Porto Alegre, RS, Brasil, pp. 25–36, 2021. DOI: 10.5753/sbbd.2021.17863.

Deza, M. M. and Deza, E. Encyclopedia of distances. In Encyclopedia of distances. Springer, Berlin, Heidelberg, pp. 1–583, 2009. DOI: 10.1007/978-3-642-00234-2.

DSouza, J. et al. Using exploratory data analysis for generating inferences on the correlation of COVID-19 cases. In ICCCNT Conference. IEEE, Kharagpur, India, pp. 1–6, 2020. DOI: 10.1109/ICCCNT49239.2020.9225621.

FAPESP. FAPESP COVID-19 Data Sharing/BR, 2020. [link].

Farias, J. d., Barioni, M. C., and Rezende, H. Explorando o uso de árvores b+ na indexação de dados por similaridade. In SBBD Conference. SBC, Porto Alegre, RS, Brasil, pp. 163–168, 2019. DOI: 10.5753/sbbd.2019.8817.

Gansel, X., Mary, M., and van Belkum, A. Semantic data interoperability, digital medicine, and e-health in infectious disease management: a review. EJCMID 38 (6): 1023–1034, 2019. DOI: 10.1007/s10096-019-03501-6.

Gonçalves, M. V. F. et al. Datasets Cured and Enriched with Provenance from the National Vaccination Campaign Against COVID-19, 2021. DOI: 10.5281/zenodo.5193920.

Guo, R. et al. Comparative visual analytics for assessing medical records with sequence embedding. Visual Informatics 4 (2): 72–85, 2020. DOI: 10.1016/j.visinf.2020.04.001.

Hameed, M. and Naumann, F. Data preparation: A survey of commercial tools. SIGMOD Rec. 49 (3): 18–29, dec, 2020. DOI: 10.1145/3444831.3444835.

Han, J., Kamber, M., and Pei, J. Data Mining: Concepts and Techniques, 3rd edition. Morgan Kaufmann, USA, 2011. ISBN: 978-0123814791.

Hoshen, Y. and Wolf, L. Unsupervised correlation analysis. In CVPR Conference. Computer Vision Foundation / IEEE Computer Society, Salt Lake City, UT, USA, pp. 3319–3328, 2018. DOI: 10.1109/CVPR.2018.00350.

Huang, H., Zhang, R., and Lu, X. A recommendation model for medical data visualization based on information entropy and decision tree optimized by two correlation coefficients. In ICICM Conference. ACM, Prague, Czech Republic, pp. 52–56, 2019. DOI: 10.1145/3357419.3357436.

Hund, M., Böhm, D., Sturm, W., Sedlmair, M., et al. Visual analytics for concept exploration in subspaces of patient groups. Brain Informatics 3 (4): 233–247, 2016. DOI: 10.1007/s40708-016-0043-5.

Jensen, P. B., Jensen, L. J., and Brunak, S. Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics 13 (6): 395–405, 2012. DOI: 10.1038/nrg3208.

Kaieski, N., de Oliveira, L. P. L., and Villamil, M. B. Vis-health: Exploratory analysis and visualization of dengue cases in brazil. In HICSS Conference. IEEE, Koloa, HI, USA, pp. 3063–3072, 2016. DOI: 10.1109/HICSS.2016.385.

Kwon, B. C., Anand, V., et al. Dpvis: Visual analytics with hidden markov models for disease progression pathways. IEEE Trans. Vis. Comput. Graph. 27 (9): 3685–3700, 2021. DOI: 10.1109/TVCG.2020.2985689.

Lanucara, S. et al. Harmonization and interoperable sharing of multi-temporal geospatial data of rural landscapes. In Int. Symp. on New Metropolitan Perspectives. Springer, Italy, pp. 51–59, 2018. DOI: 10.1007/978-3-319-92099-3_7.

Min. Saúde. Campanha nacional de vacinação contra COVID-19, 2022. [link] covid-19-vacinacao.

Müller, H., Castelo, S., Qazi, M., Freire, J., et al. Openclean - data cleaning for Python, 2021. [link].

Nouri, M., Lizotte, D. J., Sedig, K., and Abdullah, S. S. VISEMURE: A visual analytics system for making sense of multimorbidity using electronic medical record data. Data 6 (8): 85, 2021. DOI: 10.3390/data6080085.

Rodrigues, L. S., Cazzolato, M. T., Traina, A. J. M., and Traina-Jr., C. Taking advantage of highly-correlated attributes in similarity queries with missing values. In SISAP Conference. LNCS, vol. 12440. Springer, Copenhagen, Denmark, pp. 168–176, 2020. DOI: 10.1007/978-3-030-60936-8_13.

Samet, H. Foundations of multidimensional and metric data structures. M. K. series in data management systems. Academic Press, USA, 2006. ISBN: 978-0-12-369446-1.

ten-Caten, F. et al. In-depth analysis of laboratory parameters reveals the interplay between sex, age, and systemic inflammation in individuals with covid-19. IJID vol. 105, pp. 579–587, Apr, 2021. DOI: 10.1016/j.ijid.2021.03.016.

Wu, A. et al. Survey on artificial intelligence approaches for visualization data. CoRR vol. abs/2102.01330, pp. 1–20, 2021.

Yadav, P., Steinbach, M., Kumar, V., and Simon, G. Mining electronic health records (EHRs): A survey. ACM Computing Surveys 50 (6): 85:1–85:40, Jan., 2018. DOI: 10.1145/3127881.

Yang, F. et al. Correlation judgment and visualization features: A comparative study. IEEE TVCG Journal 25 (3): 1474–1488, 2019. DOI:10.1109/TVCG.2018.2810918.

Downloads

Published

2022-09-12

How to Cite

Cazzolato, M. T., S. Rodrigues, L., X. Ribeiro, M., A. Gutierrez, M., Traina Jr., C., & M. Traina, A. J. (2022). Sketch+ for Visual and Correlation-Based Exploratory Data Analysis: A Case Study with COVID-19 Databases. Journal of Information and Data Management, 13(2). https://doi.org/10.5753/jidm.2022.2484

Issue

Section

SBBD 2021 Full papers - Extended Papers