Xi-DL: A Data Lake Management System for Healthcare Data Monitoring

  • Lucas Tito Federal Fluminense University
  • Cristina Motinha Federal Fluminense University
  • Filipe Santiago Federal Fluminense University
  • Kary Ocaña National Laboratory for Scientific Computing
  • Marcos Bedo Federal Fluminense University
  • Daniel de Oliveira Federal Fluminense University

Abstract


Scientific domains are continuously producing an exponential volume of heterogeneous data (both structured and unstructured) that do not always fit into non-flexible Data Warehouse solutions. Data Lakes, on the other hand, are suitable technology for handling such data as they require no previous modeling (data are stored raw) and provide in-situ querying mechanisms. While Hadoop-based packed solutions for Data Lakes do exist, they also impose an additional burden to scientists since some non-neglectable computational expertise is required to operate them. This study introduces ξ-DL, a lightweight Data Lake management system designed for general scientific domains, and its viability evaluation regarding COVID-19 data collected in Brazil. The initial assessment with domain experts indicated the ξ-DL potential capabilities for scientific data handling.

Keywords: Data Lake Management, Data Lakes, Covid-19

References

Chen, Y., Chen, H., and Huang, P. (2018). Enhancing the data privacy for public datalakes. In 2018 IEEE International Conference on Applied System Invention (ICASI), pages 1065–1068.

Davis, F. D. (1989). Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Q., 13(3):319–340.

Fang, H. (2015). Managing data lakes in big data era: What’s a data lake and why has itbecame popular in data management ecosystem. In IEEE CYBER, pages 820–824.

Freire, J., Koop, D., Santos, E., and Silva, C. T. (2008). Provenance for computational tasks: A survey. Comput. Sci. Eng., 10(3):11–21.

Hey, T., Tansley, S., and Tolle, K., editors (2009).The Fourth Paradigm: Data-IntensiveScientific Discovery. Microsoft Research, Redmond, Washington.

Inmon, W. H. (1996). The data warehouse and data mining. CACM, 39(11):49–50.

Li, Y., Liu, B., Cui, J., Wang, Z., Shen, Y., Xu, Y., and Yao, K. (2020). Similarities and evolutionary relationships of COVID-19 and related viruses. CoRR, abs/2003.05580.

Maccioni, A. and Torlone, R. (2017). Crossing the finish line faster when paddling the data lake with kayak. PVLDB, 10(12):1853–

Mello, L. E., Suman, A., and et al. (2020). Opening Brazilian COVID-19 patient data to support world research on pandemics.

Nargesian, F., Zhu, E., Miller, R. J., Pu, K. Q., and Arocena, P. C. (2019). Data lake management: Challenges and opportunities. Proc. VLDB Endow., 12(12):1986–1989.

Shishvan, O. R., Zois, D., and Soyata, T. (2018). Machine intelligence in healthcare and medical cyber physical systems: A survey. IEEE Access, 6:46419–46494.

Silva, A. B., Guedes, A., Síndico, S., Vieira, E., and de Andrade Filha, I. (2019). Registro eletrônico de saúde em hospital de alta complexidade: um relato sobre o processo de implementação na perspectiva da telessaúde. Ciência e Saúde Coletiva, 24:1133–1142
Published
2020-09-28
TITO, Lucas; MOTINHA, Cristina; SANTIAGO, Filipe; OCAÑA, Kary; BEDO, Marcos; DE OLIVEIRA, Daniel. Xi-DL: A Data Lake Management System for Healthcare Data Monitoring. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 35. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 151-156. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2020.13633.