Proposing a data lake for health research from interoperable multicentric data pools
Abstract
With the high demand in data science, the organization and preparation of databases became critical activities, consuming more than 80% of the project effort. In the medical domain, many hospitals already use a myriad of technologies and information systems for medical records and images, but they do not always adopt standards of uniform and interoperable data, and they seldom adopt analytics-oriented tools (data lakes and warehouses). In this article we propose the data pool, an intermediate data model to ease the organization of data lakes for health research. The data pool was implemented and adopted in real medical research, supporting computational learning workflows.
Keywords:
data science, clinical research, data lake
References
Benson, T. (2012). Principles of health interoperability HL7 and SNOMED. SpringerScience & Business Media.
Cabral, E. F. and Cordeiro, R. L. (2020). Fast and scalable outlier detection with sorted hypercubes. In Proc. 29th ACM CIKM, pages 95-104.
Canêo, P. K. and Rondina, J. M. (2014). Prontuário eletrônico do paciente: conhecendo as experiências de sua implantação. JHI, 6(2).
de Amo, S. (2004). Técnicas de mineração de dados. JAI.
de Azevedo-Marques, P. M. and Salomão, S. C. (2009). Pacs: sistemas de arquivamento e distribuição de imagens. Rev. bras. fis. med., 3(1):131–139.
DiCenso, A., Bayley, L., and Haynes, R. B. (2009). Accessing pre-appraised evidence: fine-tuning the 5s model into a 6s model. Evidence-Based Nursing, 12(4):99–101.
FAPESP (2020). FAPESP COVID-19 Data Sharing/BR. https://repositoriodatasharingfapesp.uspdigital.usp.br/.
Furuie, S. S., Gutierrez, M. A., Figueiredo, J., Tachinardi, U., Rebelo, M., Bertozzo, N., Moreno, R., Motta, G., Nardon, F., and Oliveira, P. (2003). Prontuário eletrônico de pacientes: integrando informações clínicas e imagens médicas. Rev. bras. eng. biomed, pages 125–137.
Kang, B., Yoon, J., Kim, H. Y., Jo, S. J., Lee, Y., and Kam, H. J. (2021). Deep-learning-based automated terminology mapping in omop-cdm. JAMIA. [ocab030].
Larson, P.-Å., Clinciu, C., Hanson, E. N., Oks, A., Price, S. L., Rangarajan, S., Surna, A., and Zhou, Q. (2011). Sql server column store indexes. In Proc. ACM SIGMOD Conf. MOD, pages 1177–1184.
Mildenberger, P., Eichelberg, M., and Martin, E. (2002). Introduction to the dicom standard. European radiology, 12(4):920–927.
Miller, R. J. (2018). Open data integration. Proc. VLDB Endow., 11(12):21302139.
Rodrigues, L. S., Cazzolato, M. T., Traina, A. J. M., and Traina, C. (2020). Taking advantage of highly-correlated attributes in similarity queries with missing values. In Lecture Notes in Computer Science, volume 12440, pages 168–176. Springer.
Segaran, T. and Hammerbacher, J. (2009). Beautiful data: the stories behind elegant data solutions. O’Reilly Media, Inc.
Tito, L., Motinha, C., Santiago, F., Ocaña, K., Bedo, M., and de Oliveira, D. (2020). Xi-dl: um sistema de gerência de data lake para monitoramento de dados da saúde. In Anais do XXXV SBBD, pages 151–156, Porto Alegre, RS, Brasil. SBC.
Traina Jr, C., Moriyama, A., Rocha, G., Cordeiro, R., Ciferri, C. D., and Traina, A. (2019). The similarql framework: similarity queries in plain sql. In Proc. 34th ACM/SIGAPP SAC, pages 468–471.
Voss, E. A., Makadia, R., Matcho, A., Ma, Q., Knoll, C., Schuemie, M., DeFalco, F. J., Londhe, A., Zhu, V., and Ryan, P. B. (2015). Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. JAMIA, 22(3):553–564.
Cabral, E. F. and Cordeiro, R. L. (2020). Fast and scalable outlier detection with sorted hypercubes. In Proc. 29th ACM CIKM, pages 95-104.
Canêo, P. K. and Rondina, J. M. (2014). Prontuário eletrônico do paciente: conhecendo as experiências de sua implantação. JHI, 6(2).
de Amo, S. (2004). Técnicas de mineração de dados. JAI.
de Azevedo-Marques, P. M. and Salomão, S. C. (2009). Pacs: sistemas de arquivamento e distribuição de imagens. Rev. bras. fis. med., 3(1):131–139.
DiCenso, A., Bayley, L., and Haynes, R. B. (2009). Accessing pre-appraised evidence: fine-tuning the 5s model into a 6s model. Evidence-Based Nursing, 12(4):99–101.
FAPESP (2020). FAPESP COVID-19 Data Sharing/BR. https://repositoriodatasharingfapesp.uspdigital.usp.br/.
Furuie, S. S., Gutierrez, M. A., Figueiredo, J., Tachinardi, U., Rebelo, M., Bertozzo, N., Moreno, R., Motta, G., Nardon, F., and Oliveira, P. (2003). Prontuário eletrônico de pacientes: integrando informações clínicas e imagens médicas. Rev. bras. eng. biomed, pages 125–137.
Kang, B., Yoon, J., Kim, H. Y., Jo, S. J., Lee, Y., and Kam, H. J. (2021). Deep-learning-based automated terminology mapping in omop-cdm. JAMIA. [ocab030].
Larson, P.-Å., Clinciu, C., Hanson, E. N., Oks, A., Price, S. L., Rangarajan, S., Surna, A., and Zhou, Q. (2011). Sql server column store indexes. In Proc. ACM SIGMOD Conf. MOD, pages 1177–1184.
Mildenberger, P., Eichelberg, M., and Martin, E. (2002). Introduction to the dicom standard. European radiology, 12(4):920–927.
Miller, R. J. (2018). Open data integration. Proc. VLDB Endow., 11(12):21302139.
Rodrigues, L. S., Cazzolato, M. T., Traina, A. J. M., and Traina, C. (2020). Taking advantage of highly-correlated attributes in similarity queries with missing values. In Lecture Notes in Computer Science, volume 12440, pages 168–176. Springer.
Segaran, T. and Hammerbacher, J. (2009). Beautiful data: the stories behind elegant data solutions. O’Reilly Media, Inc.
Tito, L., Motinha, C., Santiago, F., Ocaña, K., Bedo, M., and de Oliveira, D. (2020). Xi-dl: um sistema de gerência de data lake para monitoramento de dados da saúde. In Anais do XXXV SBBD, pages 151–156, Porto Alegre, RS, Brasil. SBC.
Traina Jr, C., Moriyama, A., Rocha, G., Cordeiro, R., Ciferri, C. D., and Traina, A. (2019). The similarql framework: similarity queries in plain sql. In Proc. 34th ACM/SIGAPP SAC, pages 468–471.
Voss, E. A., Makadia, R., Matcho, A., Ma, Q., Knoll, C., Schuemie, M., DeFalco, F. J., Londhe, A., Zhu, V., and Ryan, P. B. (2015). Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. JAMIA, 22(3):553–564.
Published
2021-10-04
How to Cite
LIMA, Daniel M.; MORENO, Ramon A.; PIRES, Fabio A.; GUTIERREZ, Marco A..
Proposing a data lake for health research from interoperable multicentric data pools. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 36. , 2021, Rio de Janeiro.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2021
.
p. 367-372.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2021.17900.
