Experiencing ProvLake to Manage the Data Lineage of AI Workflows

  • Leonardo Guerreiro Azevedo IBM Research
  • Renan Souza IBM Research
  • Raphael Melo Thiago IBM Research
  • Elton Soares IBM Research
  • Marcio Moreno IBM Research


Machine Learning (ML) is a core concept behind Artificial Intelligence systems, which work driven by data and generate ML models. These models are used for decision making, and it is crucial to trust their outputs by, e.g., understanding the process that derives them. One way to explain the derivation of ML models is by tracking the whole ML lifecycle, generating its data lineage, which may be accomplished by provenance data management techniques. In this work, we present the use of ProvLake tool for ML provenance data management in the ML lifecycle for Well Top Picking, an essential process in Oil and Gas exploration. We show how ProvLake supported the validation of ML models, the understanding of whether the ML models generalize respecting the domain characteristics, and their derivation.
Palavras-chave: Machine Learning, Data Lineage, AI workflows


Gil, Y., Pierce, S. A., Babaie, H., and Banerjee, A. et al. (2018). Intelligent systems for geosciences: an essential research agenda. Comm. of the ACM.

Herschel, M., Diestelkämper, R., and Ben Lahmar, H. (2017). A survey on provenance: What for? what form? what from? VLDB Journal.

Rodrigues, E., Oliveira, I., Cunha, R., and Netto, M. (2018). DeepDownscale: a deep learning strategy for high-resolution weather forecast. In IEEE Int. Conf. on eScience.

Souza, R., Azevedo, L., Lourenço, V., Soares, E., Thiago, R., Brandão, R., Civitarese, D., Brazil, E., Moreno, M., Valduriez, P., Mattoso, M., and Netto, M. (2019a). Provenance data in the machine learning lifecycle in computational science and engineering. In IEEE/ACM WORKS@Supercomputing, pages 1–10.

Souza, R., Azevedo, L., Thiago, R., Soares, E., Nery, M., Netto, M., Brazil, E. V., Cerqueira, R., Valduriez, P., and Mattoso, M. (2019b). Efficient runtime capture of multiworkflow data using provenance. In IEEE Int. Conf. on eScience, pages 1–10.
AZEVEDO, Leonardo Guerreiro; SOUZA, Renan; THIAGO, Raphael Melo; SOARES, Elton; MORENO, Marcio. Experiencing ProvLake to Manage the Data Lineage of AI Workflows. In: ENCONTRO DE INOVAÇÃO EM SISTEMAS DE INFORMAÇÃO - SIMPÓSIO BRASILEIRO DE SISTEMAS DE INFORMAÇÃO (SBSI), 16. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 206-209. DOI: https://doi.org/10.5753/sbsi.2020.13144.