Dataflow Analysis of Serverless Scientific Applications using Provenance Data
Abstract
This paper presents an approach to ease dataflow analysis in Computational Science and Engineering (CSE) applications that invoke serverless functions. By capturing provenance data while executing CSE applications within a serverless environment, the approach organizes this information into an integrated database that users can query at runtime. Since serverless platforms typically lack native support for provenance tracking, the proposed solution helps users understand and analyze the behavior and outcomes of their CSE applications. We detail the main features of the approach as implemented in the DENETHOR tool and evaluate its effectiveness with a real-world CSE application. The results demonstrate that DENETHOR enhances analytical capabilities via its provenance database.
Keywords:
Dataflow, Provenance, Serverless
References
Amir, A. and Keselman, D. (1997). Maximum agreement subtree in a set of evolutionary trees: Metrics and efficient algorithms. SIAM J. on Comp., 26(6):1656–1669.
Barham, P. et al. (2004). Using magpie for request extraction and workload modelling. In OSDI’04, San Francisco, CA. USENIX Association.
Ben-Shimol, L. et al. (2025). Detection of compromised functions in a serverless cloud environment. Computers & Security, 150:104261.
Bux, M. et al. (2015). SAASFEE: scalable scientific workflow execution engine. Proc. VLDB Endow., 8(12):1892–1895.
Cantrill, B. M., Shapiro, M. W., and Leventhal, A. H. (2004). Dynamic instrumentation of production systems. In USENIX ATC’04, Boston, MA. USENIX.
Datta, P., Polinsky, I., Inam, M. A., Bates, A., and Enck, W. (2022). Alastor: Reconstructing the provenance of serverless intrusions. In USENIX Security Symposium.
de Oliveira, D., Liu, J., and Pacitti, E. (2019). Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments. Synthesis Lectures on Data Management. Morgan & Claypool Publishers.
Elshamy, A., Alquraan, A., and Al-Kiswany, S. (2023). A study of orchestration approaches for scientific workflows in serverless computing. SESAME ’23, page 34–40, New York, NY, USA. ACM.
Freire, J., Koop, D., Santos, E., and Silva, C. T. (2008). Provenance for computational tasks: A survey. Computing in Science & Engineering, 10(3):11–21.
Goloboff, P. A. et al. (2009). Phylogenetic analysis of 73 060 taxa corroborates major eukaryotic groups. Cladistics, 25(3):211–230.
Guerra, G. et al. (2012). Uncertainty quantification in computational predictive models for fluid dynamics using a workflow management engine. Int. J. for Uncert. Quant., 2(1):53–71.
Hautz, M., Ristov, S., and Felderer, M. (2023). Characterizing afcl serverless scientific workflows in federated faas. WoSC ’23, page 24–29, NY, USA. ACM.
Hellerstein, J. M. et al. (2019). Serverless computing: One step forward, two steps back. In CIDR. [link].
Herschel, M., Diestelkämper, R., and Ben Lahmar, H. (2017). A survey on provenance: What for? what form? what from? The VLDB Journal, 26.
Huang, J. et al. (2024). Faasrca: Full lifecycle root cause analysis for serverless applications. In ISSRE’24, pages 415–426. IEEE.
Kamble, S., Jin, X., Niu, N., and Simon, M. (2017). A novel coupling pattern in computational science and engineering software. In Proceedings of the 12th International Workshop on Software Engineering for Science, SE4Science ’17, page 9–12. IEEE Press.
Khochare, A., Simmhan, Y., Mehta, S., and Agarwal, A. (2022). Toward scientific workflows in a serverless world. In 2022 IEEE e-Science, pages 399–400.
Kiar, G. et al. (2019). A serverless tool for platform agnostic computational experiment management. Frontiers in Neuroinformatics, 13.
Mattoso, M., Werner, C., Travassos, G. H., Braganholo, V., Ogasawara, E., Oliveira, D., Cruz, S., Martinho, W., and Murta, L. (2010). Towards supporting the life cycle of large scale scientific experiments. International Journal of Business Process Integration and Management, 5(1):79.
Moreau, L. et al. (2008). Special issue: The first provenance challenge. Concurrency and Computation: Practice and Experience, 20(5):409–418.
Moreau, L. and Groth, P. (2013). Provenance: an introduction to prov. Synthesis Lectures on the Semantic Web: Theory and Technology, 3(4):1–129.
Neves, V. C., de Oliveira, D., Ocaña, K. A. C. S., Braganholo, V., and Murta, L. (2017). Managing provenance of implicit data flows in scientific experiments. ACM Trans. Internet Techn., 17(4):36:1–36:22.
Ocaña, K. and de Oliveira, D. (2015). Parallel computing in genomic research: advances and applications. Adv. Appl. Bioinform. Chem., 8:23–35.
Pimentel, J. F. et al. (2017). noworkflow: a tool for collecting, analyzing, and managing provenance from python scripts. VLDB, 10(12).
Pina, D., Kunstmann, L., Chapman, A., de Oliveira, D., and Mattoso, M. (2025). DLProv: a suite of provenance services for deep learning workflow analyses. PeerJ Comput. Sci., 11(e2985):e2985.
Puigbò, P. et al. (2019). Genome-wide comparative analysis of phylogenetic trees: The prokaryotic forest of life. In Evolutionary Genomics: Statistical and Computational Methods, pages 241–269. Springer New York, New York, NY.
Rude, U., Willcox, K., McInnes, L. C., and Sterck, H. D. (2018). Research and education in computational science and engineering. Siam Review, 60(3):707–754.
Satapathy, U., Thakur, R., Chattopadhay, S., and Chakraborty, S. (2023). Disprotrack: Distributed provenance tracking over serverless applications. In INFOCOM 2023, pages 1–10.
Silva, V., de Oliveira, D., Valduriez, P., and Mattoso, M. (2018). Dfanalyzer: Runtime dataflow analysis of scientific applications using provenance. Proceedings of the VLDB Endowment.
Skluzacek, T. J. et al. (2019). Serverless workflows for indexing large scientific data. In Proceedings of the 5th International Workshop on Serverless Computing, pages 43–48.
Wen, J., Chen, Z., Zhao, J., Sarro, F., Ping, H., Zhang, Y., Wang, S., and Liu, X. (2025). Scope: Performance testing for serverless computing. ACM Transactions on Software Engineering and Methodology.
Wen, J. et al. (2021). An empirical study on challenges of application development in serverless computing. In Proc. of the ESEC/FSE 2023, pages 416–428.
Barham, P. et al. (2004). Using magpie for request extraction and workload modelling. In OSDI’04, San Francisco, CA. USENIX Association.
Ben-Shimol, L. et al. (2025). Detection of compromised functions in a serverless cloud environment. Computers & Security, 150:104261.
Bux, M. et al. (2015). SAASFEE: scalable scientific workflow execution engine. Proc. VLDB Endow., 8(12):1892–1895.
Cantrill, B. M., Shapiro, M. W., and Leventhal, A. H. (2004). Dynamic instrumentation of production systems. In USENIX ATC’04, Boston, MA. USENIX.
Datta, P., Polinsky, I., Inam, M. A., Bates, A., and Enck, W. (2022). Alastor: Reconstructing the provenance of serverless intrusions. In USENIX Security Symposium.
de Oliveira, D., Liu, J., and Pacitti, E. (2019). Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments. Synthesis Lectures on Data Management. Morgan & Claypool Publishers.
Elshamy, A., Alquraan, A., and Al-Kiswany, S. (2023). A study of orchestration approaches for scientific workflows in serverless computing. SESAME ’23, page 34–40, New York, NY, USA. ACM.
Freire, J., Koop, D., Santos, E., and Silva, C. T. (2008). Provenance for computational tasks: A survey. Computing in Science & Engineering, 10(3):11–21.
Goloboff, P. A. et al. (2009). Phylogenetic analysis of 73 060 taxa corroborates major eukaryotic groups. Cladistics, 25(3):211–230.
Guerra, G. et al. (2012). Uncertainty quantification in computational predictive models for fluid dynamics using a workflow management engine. Int. J. for Uncert. Quant., 2(1):53–71.
Hautz, M., Ristov, S., and Felderer, M. (2023). Characterizing afcl serverless scientific workflows in federated faas. WoSC ’23, page 24–29, NY, USA. ACM.
Hellerstein, J. M. et al. (2019). Serverless computing: One step forward, two steps back. In CIDR. [link].
Herschel, M., Diestelkämper, R., and Ben Lahmar, H. (2017). A survey on provenance: What for? what form? what from? The VLDB Journal, 26.
Huang, J. et al. (2024). Faasrca: Full lifecycle root cause analysis for serverless applications. In ISSRE’24, pages 415–426. IEEE.
Kamble, S., Jin, X., Niu, N., and Simon, M. (2017). A novel coupling pattern in computational science and engineering software. In Proceedings of the 12th International Workshop on Software Engineering for Science, SE4Science ’17, page 9–12. IEEE Press.
Khochare, A., Simmhan, Y., Mehta, S., and Agarwal, A. (2022). Toward scientific workflows in a serverless world. In 2022 IEEE e-Science, pages 399–400.
Kiar, G. et al. (2019). A serverless tool for platform agnostic computational experiment management. Frontiers in Neuroinformatics, 13.
Mattoso, M., Werner, C., Travassos, G. H., Braganholo, V., Ogasawara, E., Oliveira, D., Cruz, S., Martinho, W., and Murta, L. (2010). Towards supporting the life cycle of large scale scientific experiments. International Journal of Business Process Integration and Management, 5(1):79.
Moreau, L. et al. (2008). Special issue: The first provenance challenge. Concurrency and Computation: Practice and Experience, 20(5):409–418.
Moreau, L. and Groth, P. (2013). Provenance: an introduction to prov. Synthesis Lectures on the Semantic Web: Theory and Technology, 3(4):1–129.
Neves, V. C., de Oliveira, D., Ocaña, K. A. C. S., Braganholo, V., and Murta, L. (2017). Managing provenance of implicit data flows in scientific experiments. ACM Trans. Internet Techn., 17(4):36:1–36:22.
Ocaña, K. and de Oliveira, D. (2015). Parallel computing in genomic research: advances and applications. Adv. Appl. Bioinform. Chem., 8:23–35.
Pimentel, J. F. et al. (2017). noworkflow: a tool for collecting, analyzing, and managing provenance from python scripts. VLDB, 10(12).
Pina, D., Kunstmann, L., Chapman, A., de Oliveira, D., and Mattoso, M. (2025). DLProv: a suite of provenance services for deep learning workflow analyses. PeerJ Comput. Sci., 11(e2985):e2985.
Puigbò, P. et al. (2019). Genome-wide comparative analysis of phylogenetic trees: The prokaryotic forest of life. In Evolutionary Genomics: Statistical and Computational Methods, pages 241–269. Springer New York, New York, NY.
Rude, U., Willcox, K., McInnes, L. C., and Sterck, H. D. (2018). Research and education in computational science and engineering. Siam Review, 60(3):707–754.
Satapathy, U., Thakur, R., Chattopadhay, S., and Chakraborty, S. (2023). Disprotrack: Distributed provenance tracking over serverless applications. In INFOCOM 2023, pages 1–10.
Silva, V., de Oliveira, D., Valduriez, P., and Mattoso, M. (2018). Dfanalyzer: Runtime dataflow analysis of scientific applications using provenance. Proceedings of the VLDB Endowment.
Skluzacek, T. J. et al. (2019). Serverless workflows for indexing large scientific data. In Proceedings of the 5th International Workshop on Serverless Computing, pages 43–48.
Wen, J., Chen, Z., Zhao, J., Sarro, F., Ping, H., Zhang, Y., Wang, S., and Liu, X. (2025). Scope: Performance testing for serverless computing. ACM Transactions on Software Engineering and Methodology.
Wen, J. et al. (2021). An empirical study on challenges of application development in serverless computing. In Proc. of the ESEC/FSE 2023, pages 416–428.
Published
2025-09-29
How to Cite
RIBEIRO, Marcello W. M.; DE PAULA, Ubiratam; KUNSTMANN, Liliane; FROTA, Yuri; ROSSETI, Isabel; DE OLIVEIRA, Daniel.
Dataflow Analysis of Serverless Scientific Applications using Provenance Data. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 40. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 56-69.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2025.247000.
