Twinscie-Prov: Gerenciando a Proveniência sobre o Ciclo-de-vida de ML no Sistema Twinscie
Resumo
O aumento da complexidade em aplicações de aprendizado de máquina exige sistemas que garantam rastreabilidade e reprodutibilidade. Este trabalho apresenta a abordagem Twinscie-Prov, uma adequação do padrão W3C PROV para estruturar a proveniência de dados e processos ao longo do ciclo de vida de Machine Learning (ML) no sistema Twinscie. Os dados de proveniência são armazenados no sistema NoSQL Neo4j, permitindo consultas complexas e auditoria. Estudos preliminares mostram que, para consultas envolvendo navegação no grafo de dependências, típicas em dados de proveniência, a implementação no Neo4j é até cinco ordens de grandeza mais rápida que a baseada em logs.
Palavras-chave:
Proveniência de Dados, Aprendizado de Máquina, Ciclo de Vida de Modelos, W3C PROV, Banco de Dados em Grafo
Referências
Castro, R., Souto, Y. M., Ogasawara, E. S., Porto, F., and Bezerra, E. (2021). Stconvs2s: Spatiotemporal convolutional sequence to sequence network for weather forecasting. Neurocomputing, 426:285–298.
de Almeida, V. K., de Oliveira, D. E., de Barros, C. D. T., Scatena, G. d. S., Queiroz Filho, A. N., Siqueira, F. L., Costa, Ogasawara, E., and Porto, F. e. a. (2024). A digital twin system for oil and gas industry: A use case on mooring lines integrity monitoring. In Proceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems, MODELS Companion ’24, page 322–331, New York, NY, USA. Association for Computing Machinery.
Grieves, M. (2014). Digital twin: Manufacturing excellence through virtual factory replication. Technical report, Florida Institute of Technology.
LNCC (2015). Sdumont. [link].
Moreau, L. and Groth, P. (2013). Prov-overview: An overview of the prov family of documents. [link].
Neo4j (2003). Graph database & analytics. [link].
Pina, D., Chapman, A., Kunstmann, L., de Oliveira, D., and Mattoso, M. (2024). Dlprov: A data-centric support for deep learning workflow analyses. In Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning, DEEM ’24, page 77–85, New York, NY, USA. Association for Computing Machinery.
Polyzotis, N., Roy, S., Whang, S. E., and Zinkevich, M. (2017). Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, page 1723–1726, New York, NY, USA. Association for Computing Machinery.
Porto, F., Ferro, M., Ogasawara, E. S., Moeda, T., de Barros, C. D. T., da Silva, A. C., Zorrilla, R., Pereira, R. S., Castro, R. N., Silva, J. V., Salles, R., Fonseca, A. J., Hermsdorff, J., Magalhães, M., Sá, V., Simões, A., Cardoso, C., and Bezerra, E. (2022). Machine learning approaches to extreme weather events forecast in urban areas: Challenges and initial results. Supercomput. Front. Innov., 9(1):49–73.
Schelter, S., Böse, J.-H., Kirschnick, J., Klein, T., and Seufert, S. (2017). Automatically tracking metadata and provenance of machine learning experiments.
Schlegel, A., Auer, S., and Vidal, M.-E. (2023). Mlflow2prov: Creating provenance graphs from mlflow metadata. In Proceedings of the 25th International Conference on Enterprise Information Systems (ICEIS), pages 579–586.
Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-k., and Woo, W.-c. (2015). Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Proceedings of the 29th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 802–810, Cambridge, MA, USA. MIT Press.
de Almeida, V. K., de Oliveira, D. E., de Barros, C. D. T., Scatena, G. d. S., Queiroz Filho, A. N., Siqueira, F. L., Costa, Ogasawara, E., and Porto, F. e. a. (2024). A digital twin system for oil and gas industry: A use case on mooring lines integrity monitoring. In Proceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems, MODELS Companion ’24, page 322–331, New York, NY, USA. Association for Computing Machinery.
Grieves, M. (2014). Digital twin: Manufacturing excellence through virtual factory replication. Technical report, Florida Institute of Technology.
LNCC (2015). Sdumont. [link].
Moreau, L. and Groth, P. (2013). Prov-overview: An overview of the prov family of documents. [link].
Neo4j (2003). Graph database & analytics. [link].
Pina, D., Chapman, A., Kunstmann, L., de Oliveira, D., and Mattoso, M. (2024). Dlprov: A data-centric support for deep learning workflow analyses. In Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning, DEEM ’24, page 77–85, New York, NY, USA. Association for Computing Machinery.
Polyzotis, N., Roy, S., Whang, S. E., and Zinkevich, M. (2017). Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, page 1723–1726, New York, NY, USA. Association for Computing Machinery.
Porto, F., Ferro, M., Ogasawara, E. S., Moeda, T., de Barros, C. D. T., da Silva, A. C., Zorrilla, R., Pereira, R. S., Castro, R. N., Silva, J. V., Salles, R., Fonseca, A. J., Hermsdorff, J., Magalhães, M., Sá, V., Simões, A., Cardoso, C., and Bezerra, E. (2022). Machine learning approaches to extreme weather events forecast in urban areas: Challenges and initial results. Supercomput. Front. Innov., 9(1):49–73.
Schelter, S., Böse, J.-H., Kirschnick, J., Klein, T., and Seufert, S. (2017). Automatically tracking metadata and provenance of machine learning experiments.
Schlegel, A., Auer, S., and Vidal, M.-E. (2023). Mlflow2prov: Creating provenance graphs from mlflow metadata. In Proceedings of the 25th International Conference on Enterprise Information Systems (ICEIS), pages 579–586.
Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-k., and Woo, W.-c. (2015). Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Proceedings of the 29th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 802–810, Cambridge, MA, USA. MIT Press.
Publicado
29/09/2025
Como Citar
BASTOS, Júlia Neumann; PORTO, Fabio; SIQUEIRA, Fábio Levy; GOMI, Edson; SANTOS, Ismael; BARREIRA, Rodrigo; SIQUEIRA, Isabela; OGASAWARA, Eduardo.
Twinscie-Prov: Gerenciando a Proveniência sobre o Ciclo-de-vida de ML no Sistema Twinscie. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 40. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 576-588.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2025.247286.
