Managing Large-Scale Scientific Experiments

  • Marta Mattoso UFRJ
  • Cláudia Werner UFRJ
  • Guilherme Horta Travassos UFRJ
  • Vanessa Braganholo UFRJ
  • Leonardo Murta UFRJ

Abstract


Several scientific areas, such as bioinformatics and oil engineering, need means of executing simulation-based experiments. The state of the practice for this, in most of the cases, consists in the execution of a set of programs. This, however, is not enough to deal with the complexity imposed by the problems that need to be analyzed. This issue gets worse with large-scale experiments. In this case, we need a system to manage the composition of processes and data in a coherent flux. Also, this system must be capable of registering the steps and parameters used in the well-succeeded executions of the experiment. The main motivation of this paper is in identifying and analyzing the challenges that need to be addressed to provide computational support to the development of large-scale scientific experiments. The challenges we identify here deal with the general problem of managing scientific experiments to several applications and resources distributed over a large-scale network such as grids. We identify three complementary research directions: the performance, the management process, and the semantic support. For each of them, we point out some possible solution paths.

References

Abrantes, J.F.; Travassos, G.H. (2007). Revisão quasi-Sistemática da Literatura: Caracterização de Métodos Ágeis de Desenvolvimento de Software. Relatório Técnico ES-714/07. COPPE/UFRJ. [link]. (último acesso em 06/05/2008).

Akram, A.; Meredith, D.; Allan, R. (2006). Evaluation of BPEL to Scientific Workflows. In: CCGRID. May, vol. 1, pp. 269-274.

Altintas, I.; Barney, O.; Jaeger-Frank, E. (2006). Provenance Collection Support in Kepler Scientific Workflow System. In: International Provenance and Annotation Workshop, pp. 1-15.

Altintas, I.; Berkley, C.; Jaeger, E.; Jones, M.; Ludäscher, B.; Mock, S. (2004). Kepler: An Extensible System for Design and Execution of Scientific Workflows. In: SSDBM, June, pp. 423-424.

Altintas, I.; Birnbaum, A.; Baldridge, K.; Sudholt, W.; Miller, M.; Amoreira, C.; Potier, Y.; Ludaescher, B. (2005). A Framework for the Design and Reuse of Grid Workflows. In: International Workshop on Scientific Applications on Grid Computing (SAG), LNCS 3458, Springer, pp. 120-133.

Anderson, E.; Callahan, S.; Freire, J.; Koop, D.; Santos, E.; Scheidegger, C.; Silva, C.; Smith, N.; Vo, H. (2007). Provenance Challenge - Vistrails, disponível em [link], consultado em Maio de 2007.

Anjomshoaa, A.; Antonioletti, M.; Atkinson, M.; Baxter, R.; Borley, A.; Hong, N.; Collins, B.; Hardman, N.; Hicken, G.; Hume, A.; Knox, A.; Jackson, M.; Krause, A.; Laws, S.; Magowan, J.; Palansuriya, C.; Paton, N.; Pearson, D.; Sugden, T.; Watson, P.; Westhead, M. (2005) The Design and Implementation of Grid Database Services in OGSA-DAI. Concurrency and Computation: Practice and Experience, v. 17(2-4), pp. 357-376.

Barreto, A. S.; Rocha, A. R. C.; Murta, L. G. P. (2007). Uma Abordagem Baseada em Técnicas de Reutilização para a Definição de Processos de Software. In: SBQS, Workshop de Teses e Dissertações em Qualidade de Software (WTDQS), Porto de Galinhas.

Barreto, A. S.; Rocha, A. R. C.; Murta, L. G. P., (2007). Uma Abordagem de Definição de Processos de Software Baseada em Reutilização. Workshop de Implementadores MPS.BR, Belo Horizonte, pp. 33-39.

Beck, K.; Beedle, M.; Van Bennekum, A.; et al. (2001). Manifesto for Agile Software Development. Disponível em [link], acessado em 27/03/2008.

Bowers, S. (2007). Accelerating Scientific Knowledge Discovery through Scientific Workflows. In: International Conference on Business Process Management, Keynote Speaker. Disponível em [link], acessado em 27/03/2008.

Braghetto, K. R.; Ferreira, J. E.; Pu, C. (2007). Using Control-Flow Patterns for Specifying Business Processes in Cooperative Environments. In: ACM SAC, 2007, Seoul, v. 2. pp. 1234-1241.

Callahan, S. P.; Freire, J.; Santos, E.; Scheidegger, C.E.; Silva, C.T.; Vo, H.T. (2006). Vistrails: Visualization meets Data Management. In: Proceedings of ACM SIGMOD, pp. 745-747.

Cavalcanti, M. C.; Targino, R.; Baião, F.; Rossle, S.; Bisch, P.; Pires, P. F.; Campos, M. L. M.; Mattoso, M. L. Q. (2005). Managing Structural Genomic Workflows using Web Services. Data & Knowledge Engineering, Elsevier, v. 53(1), p. 45-74.

Cavalcanti, M. C.; Mattoso, M.L.Q.; Campos M.L.; Llirbat F.; Simon E. (2002). An Architecture for Managing Distributed Scientific Resources. In: SSDM, IEEE Press, Edimburgo, Escócia, pp. 47-47.

Cruz, S.M.S.; Barros, P.; Bisch, P.; Campos, M.L.M.; Mattoso, M.L.Q. (2008a). Provenance services for distributed workflows. In: IEEE CCGrid, Lyon, pp. 526-533.

Cruz, S.M.S; Chirigati, F.S;Dahis, R.; Campos, M.L.M; Mattoso, M. (2008b). Using explicit control processes in distributed workflows to gather provenance. In: Second International Provenance and Annotation Workshop, IPAW 2008, Utah, EUA, a ser publicado na série LNCS.

Cruz, S.M.S.; Silva, F.N.; Gadelha Jr., L.M.R.; Cavalcanti, M.C.; Campos, M.L.M.; Mattoso, M.L.Q. (2008c). A Lightweight Middleware Monitor for Distributed Scientific Workflows. International Workshop on Workflow Systems in e-Science. In: IEEE CCGrid, Lyon, pp. 693-698.

Cruz, S.M.S.; Silva, E.; Oliveira, F.T.; Vilela, C.; Cuadrat, R.R.C.; Dávila, A.M.R.; Campos, M.L.M.; Mattoso, M.L.Q. (2008d). OrthoSearch: A Scientific Workflow Approach to Detect Distant Homologies on Protozoans. In: ACM SAC, Fortaleza, v. II. pp. 1281-1285.

Daltio, J.; Medeiros, C.B. (2007). Um Serviço de Ontologias para Sistemas de Biodiversidade. In: SEMISH, pp.2143-2157.

Davidson, S.; Cohen-Boulakia, S.; Eyal, A.; Bertram Ludascher, Timothy McPhillips, Shawn Bowers, and Juliana Freire (2007). Provenance in Scientific Workflow Systems, IEEE Data Engineering Bulletin, 32(4), pp. 44-50.

Dávila, A.M.R.; Lorenzini, D.; Mendes, P.; Satake, T.; Sousa, G.; Campos, L.; Mazzoni, C., Wagner, G.; Pires, P.; Grisard, E.; Cavalcanti, M.; Campos, M.L.M. (2005). GARSA: genomic analysis resources for sequence annotation. Bioinformatics, v. 21(23), pp. 4302-4303.

Dávila, A. M. R.; Mendes, P; Wagner, G.; Tschoeke, D.; Cuadrat, R.; Liberman, F.; Matos, L.; Satake, T.; Ocaña, K.; Triana, O.; Cruz, S.; Jucá, H.; Cury, J.; Silva, F.; Geronimo, G.; Ruiz, M.; Ruback, E.; Silva, F.; Probst, C.; Grisard, E.; Krieger, M.; Goldenberg, S.; Cavalcanti, M.; Moraes, M.; Campos, M.; Mattoso, M. (2008). ProtozoaDB: dynamic visualization and exploration of protozoan genomes. Nucleic Acids Research (Database Issue), v. 36, pp. 547-552.

Digiampietri, L.A.; Pérez-Alcázar, J.J.; Medeiros, C.B.(2007). An ontology-based framework for bioinformatics workflows. IJBRA, Inderscience, v. 3(3) pp. 268-285.

Ferraz, C.; Braganholo, V.; Mattoso, M. (2007). Storing AXML documents with ARAXA. In: SBBD, João Pessoa, PB, pp. 255-269.

Figueiredo, G.; Braganholo, V. ; Mattoso, M. (2007). Um Mediador para o Processamento de Consultas sobre Bases XML Distribuídas. In: Sessão de Demos do SBBD, 2007, João Pessoa, PB, pp. 21-26.

Foster, I.; Voeckler, J.; Wilde, M.; Zhao, Y. (2002). Chimera: A Virtual Data System for Representing, Querying and Automating Data Derivation. In: SSDM, Edinburgh, Scotland, pp. 37-46.

Foster, I.; Voeckler, J.; Wilde, M.; Zhao, Y. (2003). The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration. In: Conference on Innovative Data System Research (CIDR), Asilomar, CA, USA, pp. 11-11.

Gil, Y.; Deelman, E.; Mark H. Ellisman, Thomas Fahringer, Geoffrey Fox, Dennis Gannon, Carole A. Goble, Miron Livny, Luc Moreau, Jim Myers. (2007). Examining the Challenges of Scientific Workflows. IEEE Computer, v. 40(12), pp. 24-32.

Goble, C.; De Roure, D. (2007). myExperiment: social networking for workflow-using e-scientists. In: WORKS, Monterey, California, USA. [link]

Kotowski, N.; Lima, A. A.; Pacitti, E.; Valduriez, P.; Mattoso, M. L. Q. (2008). Parallel Query Processing for OLAP in Grids. Concurrency and Computation: Practice & Experience, v. online, a ser publicado.

Krueger, C.W. (1992). Software Reuse. ACM Computing Surveys, v. 24(2), pp. 131-183.

Ludäscher, B.; Altintas, I.; Berkley, C.; Higgins, D.; Jaeger, E.; Jones, M.; Lee, E.; Tao, J.; Zhao, Y. (2006). Scientific workflow management and the Kepler system. Concurrency and Computation: Practice and Experience, v. 18(10), pp. 1039-1065.

Mattoso, M. L. Q.; Zimbrão, G.; Lima, A. A.; Baião, F.; Braganholo, V.; Aveleda, A. A.; Miranda, B.; Almentero, B. K. ; Costa, M. N. (2005). ParGRES: Middleware para Processamento Paralelo de Consultas OLAP em Clusters de Banco de Dados. In: SBBD - Sessão de Demos, Uberlândia. pp.19-24.

Mattoso, M. L. Q. (Org.); Ferreira, J.E. (Org.); Braganholo, V. (Org.). E-Science Workshop. Co-realizado ao SBBD/SBES. ed. Sociedade Brasileira de Computação, 2007. v. 1. 82 p.

Mendonça, M.G.; Maldonado, J.C.; Oliveira, M.C.; Carver, J.; Fabbri, S.; Shull, F.; Travassos, G.H., Höhn, E.N.; Basili, V. R. (2008). A Framework for Software Engineering Experimental Replications. In: IEEE International Conference on Engineering of Complex Computer Systems, Belfast, Northern Ireland, pp. 203-212.

Meyer, L. A. V. C.; Wilde, M.; Mattoso, M. L. Q.; Foster, I. (2006a). Planning Spatial Workflows to Optimize Grid Performance. In: ACM SAC, Dijon, v. 1. pp. 786-790.

Meyer, L. A. V. C.; Wilde, M.; Mattoso, M. L. Q.; Foster, I. (2006b). An Opportunistic Algorithm for Scheduling Workflows on Grids. In: Vecpar - LNCS, v. 4395. pp. 1-12.

Murta, L. G. P. (2006). Gerência de Configuração no Desenvolvimento baseado em Componentes, Leonardo Murta, Tese de Doutorado PESC/ COPPE/UFRJ, Rio de Janeiro, Outubro.

Murta, L. G. P.; Van Der Hoek, A.; Werner, C. M. L. (2008). Continuous and Automated Evolution of Architecture-to-Implementation Traceability Links. Automated Software Engineering Journal, v. 15, pp. 75-107.

Murta, L. G. P.; Oliveira, H. L. R.; Dantas, C. R.; Lopes, L. G. B.; Werner, C. M. L. (2007). Odyssey- SCM: An Integrated Software Configuration Management Infrastructure for UML models. Science of Computer Programming, v. 65 (3), pp. 249-274.

myExperiment (2008). myExperiment Project. Disponível em [link], acesso em 03/2008.

Oinn, T.; Greenwood, M.; Addis, et al. (2006). Taverna: Lessons in creating a workflow environment for the life sciences, Concurrency and Computation: Practice & Experience, v.18 (10), pp. 1067-1100.

Oliveira, F. T.; Murta, L.; Werner, C.; Mattoso, M. (2008). Using Provenance to Improve Workflow Design. In: International Provenance and Annotation Workshop, IPAW 2008, Utah, EUA, a ser publicado na série LNCS.

Osterweil, L. (1987), Software Processes Are Software Too. In: International Conference on Software Engineering, Monterey, Estados Unidos, Abril, pp. 2-13.

Pacitti, E.; Mattoso, M. L. Q.; Valduriez, P. (2007). Grid Data Management: Open Problems and New Issues. Journal of Grid Computing, v. 5(3), pp. 237-281.

Paes, M.; Lima, A. A.; Valduriez, P.; Mattoso, M.L.Q. (2008). High-performance Query Processing of a Real-world OLAP Database with ParGRES. In: VECPAR, Toulouse, a ser publicado.

Pereira, D.; Ruberg, G.; Mattoso, M.L.Q. (2006). Geração Eficiente de Planos de Materialização para Documentos XML Ativos. In: SBBD, Florianópolis. pp. 136-250.

Provenance Challenge (2007). Disponível em [link], acesso em 03/2008.

Ptolemy Project (2007). Disponível em [link], acesso em 05/ 2007.

Rajasekar, A.; Wan, M. (2002). SRB & SRBRack – Components of a Virtual Data Grid Architecture. Advanced Simulations Technologies Conference, San Diego, EUA.

Ruberg, G.; Mattoso, M.L.Q. (2008). XCraft: Boosting the Performance of Active XML Materialization. In: EDBT, Nantes, França, ACM Int. Conf. Proceeding Series, 2008. v. 261. pp. 299-310.

SBC (Sociedade Brasileira de Computação) (2006). Grandes Desafios da Computação no Brasil: 2006-2016. Disponível em [link]

Shull, F.; Mendonça, M.; Basili, V.; Carver, J.; Maldonado, J. C.; Fabbri, S.; Travassos, G. H.; Ferreira, M. C. (2004). Knowledge-Sharing Issues in Experimental Software Engineering. Empirical Software Engineering, v. 9(1-2), pp. 111-137.

Stevens, R.; Zhao, A.; Goble, C.A. (2007). Using provenance to manage knowledge of In Silico experiments. Briefings in Bioinformatics, v. 8(3), pp. 183-194.

Targino, R.; Cavalcanti, M. C.; Mattoso, M. L. Q. (2005). An Environment to Define and Execute In-Silico Workflows Using Web Services. In: International Workshop on Data Integration in the Life Sciences (DILS), San Diego. LNBI, v. 3615. pp. 288-291.

Taverna Workbench (2007). Disponível em [link], acesso em 03/2008.

Travassos, G. H.; Barros, M. O. (2003). Contributions of In Virtuo and In Silico Experiments for the Future of Empirical Studies in Software Engineering. In: Workshop on Empirical Software Engineering: The Future of Empirical Studies in Software Engineering, Roma. Fraunhofer IRB Verlag, pp. 117-130.

Travassos, G.H.; Santos, P.S.M.; Mian, P.G.; Dias Neto, A.C.; Biolchini, J. (2008). An Environment to Support Large Scale Experimentation in Software Engineering. In: IEEE International Conference on Engineering of Complex Computer Systems (ICECCS). Belfast, Northern Ireland, pp. 193-202.

UK e-Science (2001). UK e-Science Programme. Disponível em [link].

van der Aalst, W.M.P. (2004). Business Process Management Demystified: A Tutorial on Models, Systems and Standards for Workflow Management. LNCS, vol. 3098, pp. 1-65.

Vasconcelos Jr, F.; Werner, C. (1997). Software Development Process Reuse based on Patterns. In: ICSEKE, Madri, Espanha, junho, pp. 97-104.

Vasconcelos Jr, F.; Werner, C. (1998). Organizing the Software Development. Process Knowledge: An Approach Based on Patterns. International Journal of Software Engineering & Knowledge Engineering, v. 8(4), pp. 461-482.

Venugopal, S.; Buyya, R. ; Ramamohanarao, K. (2006). A Taxonomy of Data Grids for Distributed Data Sharing, Management, and Processing. ACM Computing Surveys, v. 38(1), Artigo 3, 53p.

Werner, C. (1992a). Reutilização de Software no Desenvolvimento de Software Científico, Tese de Doutorado do PESC/COPPE/UFRJ, Rio de Janeiro, Março.

Werner, C. (1992b). Software Reusability in a Scientific Software Development Framework. In: Computing in High Energy Physics conference, Annecy, França, 1992, pp. 579-582.

Wolstencroft, K.; Alper, P. ;Duncan Hull, Chris Wroe, Phillip W. Lord, Robert D. Stevens, Carole A. Goble. (2007). The myGrid ontology: bioinformatics service discovery. IJBRA, v. 3(3), pp. 303-325.

WORKS'07 (2007), 2nd Workshop on Workflows in Support of Large-Scale Science, [link]

Wroe, C.; Carole A. Goble, Antoon Goderis, Phillip W. Lord, Simon Miles, Juri Papay, Pinar Alper, Luc Moreau(2007). Recycling workflows and services through discovery and reuse. Concurrency and Computation: Practice and Experience, v. 19(2), pp. 181-194.

Yu, J.; Buyya, R. (2005). A taxonomy of scientific workflow systems for grid computing. ACM SIGMOD Record, vol. 34, pp. 44-49.

Zanikolas, S.; Sakellariou, R. (2005). A taxonomy of grid monitoring systems. Future Generation Computer Systems, v. 21, pp. 163–188.
Published
2008-07-12
MATTOSO, Marta; WERNER, Cláudia; TRAVASSOS, Guilherme Horta; BRAGANHOLO, Vanessa; MURTA, Leonardo. Managing Large-Scale Scientific Experiments. In: INTEGRATED SOFTWARE AND HARDWARE SEMINAR (SEMISH), 35. , 2008, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2008 . p. 121-135. ISSN 2595-6205.