Integrated Analysis of Heterogeneous Provenance Graphs Using a PolyStore Approach

  • Yan Mendes UFJF
  • Victor Ströele UFJF
  • Daniel de Oliveira UFF
  • Kary Ocaña LNCC

Abstract


Workflows’ provenance data are captured by several existing Wofkflow Management Systems (WfMSs). Distinct WfMSs use different storing formats to represent data and, usually, captures and store data in different granularities using a graph-like shape. This allows researchers to analyze and validate their workflows’ results. Yet, in more complex scenarios where scientists need to compare provenance data originated from different WfMSs and workflows, a challenge emerges. To solve this problem, we propose an approach named PolyFlow, based on Polystore systems, being able to integrate multiple heterogeneous provenance databases adopting an on-demand global schema (ProvONE), i.e., it transforms the data in execution time, allowing researchers to query multiple provenance graphs via , exploring and linking provenance of different workflows. To assess PolyFlow’s viability, we developed conceptual to two WfMSs (Swift/T and Kepler) using a real experiment to analyze phylogenetic data.

Keywords: Workflows’ provenance, PolyStore Approach, on-demand global schema, multiple provenance graphs, workflow provenance

References

Altintas, I., Berkley, C., Jaeger, E., Jones, M. B., Ludäscher, B., and Mock, S. (2004). Kepler: An extensible system for design and execution of scientific workflows. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM 2004), 21-23 June 2004, Santorini Island, Greece, pages 423–424. DOI: https://doi.org/10.1109/SSDM.2004.1311241

de Oliveira, D., Liu, J., and Pacitti, E. (2019). Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments. Synthesis Lectures on Data Management. Morgan & Claypool Publishers. DOI: https://doi.org/10.2200/S00915ED1V01Y201904DTM060

de Oliveira, W. M., Missier, P., Ocaña, K. A. C. S., de Oliveira, D., and Braganholo, V. (2016). Analyzing provenance across heterogeneous provenance graphs. In Provenance and Annota tion of Data and Processes - 6th International Provenance and Annotation Workshop, IPAW 2016, McLean, VA, USA, June 7-8, 2016, Proceedings, pages 57–70. DOI: https://doi.org/10.1007/978-3-319-40593-3_5

Dziedzic, A., Elmore, A. J., and Stonebraker, M. (2016). Data transformation and migration in polystores. In 2016 IEEE High Performance Extreme Computing Conference, HPEC 2016, Waltham, MA, USA, September 13-15, 2016, pages 1–6. DOI: https://doi.org/10.1109/HPEC.2016.7761594

Ellqvist, T., Koop, D., Freire, J., Silva, C., and Strömbäck, L. (2009). Using mediation to achieve provenance interoperability. In Services-I, 2009 World Conference on, pages 291–298. IEEE. DOI: https://doi.org/10.1109/SERVICES-I.2009.68

Freire, J., Koop, D., Santos, E., and Silva, C. T. (2008). Provenance for Computational Tasks: A Survey. Computing in Science & Engineering, pages 20–30. DOI: https://doi.org/10.1109/MCSE.2008.79

Gadepally, V., Chen, P., Duggan, J., Elmore, A., Haynes, B., Kepner, J., Madden, S., Mattson, T., and Stonebraker, M. (2016). The bigdawg polystore system and architecture. In High Performance Extreme Computing Conference (HPEC), 2016 IEEE, pages 1–6. IEEE. DOI: https://doi.org/10.1109/HPEC.2016.7761636

Mattoso, M., Werner, C., Travassos, G. H., Braganholo, V., Ogasawara, E. S., de Oliveira, D., da Cruz, S. M. S., Martinho, W., and Murta, L. (2010). Towards supporting the life cycle of large scale scientific experiments. IJBPIM, 5(1):79–92. DOI: https://doi.org/10.1504/IJBPIM.2010.033176

Missier, P., Ludäscher, B., Bowers, S., Dey, S., Sarkar, A., Shrestha, B., Altintas, I., Anand, M. K., and Goble, C. (2010). Linking multiple workflow provenance traces for interoperable collaborative science. In WORKS 2010, pages 1–8. IEEE. DOI: https://doi.org/10.1109/WORKS.2010.5671861

Mondelli, M. L., Magalhães, T., Loss, G., Wilde, M., Foster, I. T., Mattoso, M., Katz, D. S., Barbosa, H. J. C., de Vasconcelos, A. T. R., Ocaña, K. A. C. S., and Jr., L. M. R. G. (2018). Bioworkbench: A high-performance framework for managing and analyzing bioinformatics experiments. CoRR, abs/1801.03915. DOI: https://doi.org/10.7717/peerj.5551

Moreau, L., Freire, J., Futrelle, J., McGrath, R. E., Myers, J., and Paulson, P. (2008). The open provenance model: An overview. In International Provenance and Annotation Workshop, pages 323–326. Springer. DOI: https://doi.org/10.1007/978-3-540-89965-5_31

Moreau, L. and Groth, P. T. (2013). Provenance: An Introduction to PROV. Synthesis Lectures on the Semantic Web: Theory and Technology. Morgan & Claypool Publishers. DOI: https://doi.org/10.2200/S00528ED1V01Y201308WBE007

Ocaña, K. A., de Oliveira, D., Ogasawara, E., Dávila, A. M., Lima, A. A., and Mattoso, M. (2011). Sciphy: a cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In BSB11, pages 66–70. Springer. DOI: https://doi.org/10.1007/978-3-642-22825-4_9

Oliveira, W., Missier, P., Ocaña, K., de Oliveira, D., and Braganholo, V. (2016). Analyzing provenance across heterogeneous provenance graphs. In IPAW, pages 57–70. Springer. DOI: https://doi.org/10.1007/978-3-319-40593-3_5

Özsu, M. T. and Valduriez, P. (2011). Principles of distributed database systems. Springer Science & Business Media.

Prabhune, A., Zweig, A., Stotzka, R., Gertz, M., and Hesser, J. (2016). Prov2one: an algorithm for automatically constructing provone provenance graphs. In IPAW, pages 204–208. Springer. DOI: https://doi.org/10.1007/978-3-319-40593-3_22

Prabhune, A., Zweig, A., Stotzka, R., Hesser, J., and Gertz, M. (2018). P-PIF: a provone provenance interoperability framework for analyzing heterogeneous workflow specifications and provenance traces. Distributed and Parallel Databases, 36(1):219–264. DOI: https://doi.org/10.1007/s10619-017-7216-y

Wozniak, J. M., Armstrong, T. G., Wilde, M., Katz, D. S., Lusk, E. L., and Foster, I. T. (2013). Swift/t: Large-scale application composition via distributed-memory dataflow processing. In 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2013, Delft, Netherlands, May 13-16, 2013, pages 95–102. DOI: https://doi.org/10.1109/CCGrid.2013.99
Published
2019-10-07
MENDES, Yan; STRÖELE, Victor; DE OLIVEIRA, Daniel; OCAÑA, Kary. Integrated Analysis of Heterogeneous Provenance Graphs Using a PolyStore Approach. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 34. , 2019, Fortaleza. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 73-84. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2019.8809.