TY - JOUR AU - Gonçalves, João AU - de Oliveira, Daniel AU - Ocaña, Kary AU - Ogasawara, Eduardo AU - Dias, Jonas AU - Mattoso, Marta PY - 2013/06/07 Y2 - 2024/03/28 TI - Performance Analysis of Data Filtering in Scientific Workflows JF - Journal of Information and Data Management JA - JIDM VL - 4 IS - 1 SE - SBBD 2012 Short Papers DO - 10.5753/jidm.2013.1466 UR - https://sol.sbc.org.br/journals/index.php/jidm/article/view/1466 SP - 17 AB - A major issue during scientific workflow execution is how to manage the large volume of data to be processed. This issue is even more complex in cloud computing where all resources are configurable in a pay per use model. A possible solution is to take advantage of the exploratory nature of the experiment and adopt filters to reduce data flow between activities. During a data exploration evaluation, the scientist may discard superfluous data (which is producing results that do not comply with a given quality criteria) produced during the workflow execution, avoiding unnecessary computations in the future. These quality criteria can be evaluated based on provenance and domain-specific data. We claim that the final decision on whether to discard superfluous data may become feasible only when workflows can be steered by scientists at runtime using provenance data enriched with domain-specific data. In this article, we introduce Provenance Analyzer (PA), which is an approach that allows for examining the quality of data during the workflow execution by querying provenance. PA removes superfluous data, improving execution time that typically lasts for days or weeks. Our approach introduces a component that enables sophisticated provenance analysis that allows for deciding at runtime if data have to be propagated or not to the subsequent activities of the workflow. This is possible as PA relies on data centric workflow algebra. In this context, PA plays the role of filter operator in the algebra. Scientists are able to change filter criteria during workflow execution according to the behavior of the execution. Our experiments use a real phylogenetic analysis workflow on top of SciCumulus parallel workflow cloud execution engine. Results show data reduction of 23\%, which led to performance improvements of up to 36.2\% when compared to a workflow without PA. ER -