Gerência de Dados de Proveniência Distribuídos de Experimentos Científicos: um Mapeamento Sistemático
Resumo
Experimentos científicos baseados em simulações (chamados de in silico) são fortemente dependente de recursos computacionais. Muitos experimentos são compostos de centenas ou milhares de invocações de programas. Esses experimentos comumente se beneficiam de ambientes de processamento de alto desempenho (PAD) como clusters e nuvens de computadores para acelerar sua execução. Entretanto, mesmo executando em ambientes de PAD, o volume de dados (dados de execução dos experimentos e dados de proveniência) produzido/consumido e que deve ser gerenciado, pode se tornar um gargalo. A gerência desses dados, se realizada de forma centralizada, pode impactar na análise e validação dos resultados e também no próprio desempenho da execução do experimento. Uma alternativa é armazenar e consultar esses dados de forma distribuída, o que adiciona desafios. Apesar de existirem abordagens para gerência de dados de proveniência distribuídos, não há um padrão de fato. Isso torna muito difícil correlacionar, classificar e comparar as várias abordagens existentes. Ao longo dos anos, mapeamentos sistemáticos e taxonomias foram usados para criar modelos que permitem o levantamento e a classificação de abordagens dentro de um domínio. O principal objetivo deste artigo é aplicar um mapeamento sistemático sobre a área de gerência de dados de proveniência distribuídos e propor uma taxonomia deste domínio, classificando as abordagens existentes de acordo com as classes da taxonomia
Referências
Aniello, L., Baldoni, R., Gaetani, E., Lombardi, F., Margheri, A., and Sassone, V. (2017). A prototype evaluation of a tamper-resistant high performance blockchain-based transaction log for a distributed database. In 2017 EDCC, pages 151–154.
Arab, B. S., Gawlick, D., Krishnaswamy, V., Radhakrishnan, V., and Glavic, B. (2018). Using reenactment to retroactively capture provenance for transactions. IEEE Trans. on Know. and Data Eng., 30(3):599–612.
Bates, A., Tian, D. J., Butler, K. R., and Moyer, T. (2015). Trustworthy whole-system provenance for the linux kernel. In USENIX Security, pages 319–334, Washington, D.C. USENIX Association.
da Cruz, S. M. S., Campos, M. L. M., and Mattoso, M. (2009). Towards a taxonomy of provenance in scientific workflow management systems. In 2009 IEEE Services, Los Angeles, CA, USA, pages 259–266. IEEE Computer Society.
da Cruz, S. M. S., Manhães, L. M. B., Costa, M., and Zavaleta, J. (2012). Analysing e-business applications with business provenance. In DCNET/ICE-B/OPTICS.
da Cruz, S. M. S., Silva, C. E. P., de Oliveira, D., Campos, M. L. M., and Mattoso, M. (2011). Capturing distributed provenance metadata from cloud-based scientific workflows. JIDM, 2(1):43–50.
Dai, D., Chen, Y., Carns, P., Jenkins, J., and Ross, R. (2017). Lightweight provenance service for high-performance computing. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 117–129.
Dalpra, H. L. O., Costa, G. C. B., Sirqueira, T. F. M., Braga, R. M. M., Campos, F., Werner, C. M. L., and David, J. M. N. (2015). Using ontology and data provenance to improve software processes. In ONTOBRAS), São Paulo, Brazil, volume 1442. CEURWS.org.
de Oliveira, D., Baião, F. A., and Mattoso, M. (2010). Towards a Taxonomy for Cloud Computing from an e-Science Perspective, pages 47–62. Springer London, London.
de Oliveira, D., Ocana, K. A. C. S., Baião, F. A., and Mattoso, M. (2012). A provenance-based adaptive scheduling heuristic for parallel scientific workflows in clouds. J. Grid Comput., 10(3):521–552.
de Oliveira, D. C. M., Liu, J., and Pacitti, E. (2019). Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments. Synthesis Lectures on Data Management. Morgan & Claypool Publishers.
de Oliveira, W. M., de Oliveira, D., and Braganholo, V. (2018). Provenance analytics for workflow-based computational experiments: A survey. ACM Comput. Surv., 51(3):53:1–53:25.
Duggan, J., Elmore, A. J., Stonebraker, M., Balazinska, M., Howe, B., Kepner, J., Madden, S., Maier, D., Mattson, T., and Zdonik, S. B. (2015). The bigdawg polystore system. SIGMOD Rec., 44(2):11–16.
Freire, J., Koop, D., Santos, E., and Silva, C. T. (2008). Provenance for Computational Tasks: A Survey. Computing in Science & Engineering, 10(3):11–21.
Gehani, A. and Tariq, D. (2012). Spade: Support for provenance auditing in distributed environments. In Middleware 2012, pages 101–120. Springer Berlin Heidelberg.
Hammad, R. and Wu, C. (2014). Provenance as a service: A data-centric approach for real-time monitoring. In 2014 IEEE International Congress on Big Data, pages 258–265.
Heinis, T. and Alonso, G. (2008). Efficient lineage tracking for scientific workflows. Proceedings of the ACM SIGMOD International Conference on Management of Data.
Huynh, T. D. and Moreau, L. (2015). Provstore: A public provenance repository. In Ludascher, B. and Plale, B., editors, ¨ Provenance and Annotation of Data and Processes, pages 275–277, Cham. Springer International Publishing.
Ikeda, R., Park, H., and Widom, J. (2011). Provenance for generalized map and reduce workflows. In CIDR, Asilomar, CA, USA, pages 273–283. www.cidrdb.org.
Kelbert, F. and Pretschner, A. (2018). Data usage control for distributed systems. ACM Transactions on Privacy and Security (TOPS), 21.
Li, X., Xu, X., and Malik, T. (2016). Interactive provenance summaries for reproducible science. In 2016 IEEE 12th International Conference on e-Science (e-Science), pages 355–360.
Liu, M., Taylor, N., Zhou, W., Ives, Z., and Loo, B. (2010). Maintaining recursive views of regions and connectivity in networks. IEEE Trans. on Knowl. and Data Eng., 22:1126–1141.
Ma, T., Wang, H., Cao, J., Yong, J., and Zhao, Y. (2016). Access control management with provenance in healthcare environments. In IEEE (CSCWD), pages 545–550.
Malik, T., Gehani, A., Tariq, D., Zaffar, Fareed”, e. Q., Bai, Q., Giugni, S., Williamson, D., and Taylor, J. (2013). Sketching Distributed Data Provenance, pages 85–107. Springer.
Niu, X., Kapoor, R., Glavic, B., Gawlick, D., Liu, Z. H., Krishnaswamy, V., and Radhakrishnan, V. (2017). Provenance-aware query optimization. In IEEE (ICDE), pages 473–484.
Oliveira, D., Boeres, C., Fausti, A., and Porto, F. (2015). Avaliação da localidade de dados intermediários na execução paralela de workflows big data. In ˜Brazilian Simposium on Databases.
Ozsu, M. T. and Valduriez, P. (1991). ¨ Principles of Distributed Database Systems. Springer.
Petersen, K., Vakkalanka, S., and Kuzniarz, L. (2015). Guidelines for conducting systematic mapping studies in software engineering: An update. Information & Software Technology, 64:1–18.
Pimentel, J. F., Freire, J., Murta, L., and Braganholo, V. (2019). A survey on collecting, managing, and analyzing provenance from scripts. ACM Comput. Surv., 52(3):47:1–47:38.
Pineda-Morales, L., Liu, J., Costan, A., Pacitti, E., Antoniu, G., Valduriez, P., and Mattoso, M. (2016). Managing hot metadata for scientific workflows on multisite clouds. In 2016 IEEE International Conference on Big Data (Big Data), pages 390–397.
Simmhan, Y., Plale, B., and Gannon, D. (2005). A survey of data provenance in e-science. SIGMOD Rec., 34(3):31–36.
Suriarachchi, I. and Plale, B. (2016). Crossing analytics systems: A case for integrated provenance in data lakes. In IEEE e-Science, Baltimore, USA, pages 349–354. IEEE Computer Society.
Wohlin, C. (2014). Guidelines for snowballing in systematic literature studies and a replication in software engineering. EASE ’14. ACM.
Xie, Y., Feng, D., Tan, Z., and Zhou, J. (2016). Unifying intrusion detection and forensic analysis via provenance awareness. Future Generation Computer Systems, 61:26 – 36.
Zawoad, S., Hasan, R., and Islam, K. (2018). Secprov: Trustworthy and efficient provenance management in the cloud. In IEEE INFOCOM, pages 1241–1249.
Zhang, Y., O’Neill, A., Sherr, M., and Zhou, W. (2017). Privacy-preserving network provenance. Proc. VLDB Endow., 10(11):1550–1561.
Zhao, D., Shou, C., Maliky, T., and Raicu, I. (2013). Distributed data provenance for large-scale data-intensive computing. In IEEE (CLUSTER), pages 1–8.
Zhou, W., Ding, L., Haeberlen, A., Ives, Z., and Loo, B. (2011). Tap: Time-aware provenance for distributed systems.