Twinscie-Prov: Managing Provenance in the ML Lifecycle with Twinscie

  • Júlia Neumann Bastos National Laboratory for Scientific Computing (LNCC) http://orcid.org/0009-0000-5477-3605
  • Fabio Porto National Laboratory for Scientific Computing (LNCC)
  • Fábio Levy Siqueira University of São Paulo (USP)
  • Edson Gomi University of São Paulo (USP)
  • Ismael Santos Petrobras
  • Rodrigo Barreira Petrobras
  • Isabela Siqueira Petrobras
  • Eduardo Ogasawara Federal Center for Technological Education Celso Suckow da Fonseca (CEFET/RJ)

Abstract


The increasing complexity of machine learning (ML) applications necessitates systems that ensure traceability and reproducibility. This work introduces the Twinscie-Prov approach, an adaptation of the W3C PROV standard, to structure data and process provenance throughout the ML lifecycle within the Twinscie system. Provenance data is stored in the Neo4j NoSQL database, enabling complex queries and auditing. Preliminary studies indicate that, for queries involving navigation through dependency graphs, typical in provenance data, the Neo4j implementation is up to five orders of magnitude faster than log-based approaches.
Keywords: Data Provenance, Machine Learning, Model Lifecycle, W3C PROV, Graph Database

References

Castro, R., Souto, Y. M., Ogasawara, E. S., Porto, F., and Bezerra, E. (2021). Stconvs2s: Spatiotemporal convolutional sequence to sequence network for weather forecasting. Neurocomputing, 426:285–298.

de Almeida, V. K., de Oliveira, D. E., de Barros, C. D. T., Scatena, G. d. S., Queiroz Filho, A. N., Siqueira, F. L., Costa, Ogasawara, E., and Porto, F. e. a. (2024). A digital twin system for oil and gas industry: A use case on mooring lines integrity monitoring. In Proceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems, MODELS Companion ’24, page 322–331, New York, NY, USA. Association for Computing Machinery.

Grieves, M. (2014). Digital twin: Manufacturing excellence through virtual factory replication. Technical report, Florida Institute of Technology.

LNCC (2015). Sdumont. [link].

Moreau, L. and Groth, P. (2013). Prov-overview: An overview of the prov family of documents. [link].

Neo4j (2003). Graph database & analytics. [link].

Pina, D., Chapman, A., Kunstmann, L., de Oliveira, D., and Mattoso, M. (2024). Dlprov: A data-centric support for deep learning workflow analyses. In Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning, DEEM ’24, page 77–85, New York, NY, USA. Association for Computing Machinery.

Polyzotis, N., Roy, S., Whang, S. E., and Zinkevich, M. (2017). Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, page 1723–1726, New York, NY, USA. Association for Computing Machinery.

Porto, F., Ferro, M., Ogasawara, E. S., Moeda, T., de Barros, C. D. T., da Silva, A. C., Zorrilla, R., Pereira, R. S., Castro, R. N., Silva, J. V., Salles, R., Fonseca, A. J., Hermsdorff, J., Magalhães, M., Sá, V., Simões, A., Cardoso, C., and Bezerra, E. (2022). Machine learning approaches to extreme weather events forecast in urban areas: Challenges and initial results. Supercomput. Front. Innov., 9(1):49–73.

Schelter, S., Böse, J.-H., Kirschnick, J., Klein, T., and Seufert, S. (2017). Automatically tracking metadata and provenance of machine learning experiments.

Schlegel, A., Auer, S., and Vidal, M.-E. (2023). Mlflow2prov: Creating provenance graphs from mlflow metadata. In Proceedings of the 25th International Conference on Enterprise Information Systems (ICEIS), pages 579–586.

Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-k., and Woo, W.-c. (2015). Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Proceedings of the 29th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 802–810, Cambridge, MA, USA. MIT Press.
Published
2025-09-29
BASTOS, Júlia Neumann; PORTO, Fabio; SIQUEIRA, Fábio Levy; GOMI, Edson; SANTOS, Ismael; BARREIRA, Rodrigo; SIQUEIRA, Isabela; OGASAWARA, Eduardo. Twinscie-Prov: Managing Provenance in the ML Lifecycle with Twinscie. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 40. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 576-588. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2025.247286.