Evaluation of LLMs’ Capability to Specify Workflows
Abstract
This paper evaluates the use of Large Language Models (LLMs) for specifying workflows from natural language descriptions. Three LLMs (GPT-4o, DeepSeek V3, and Command-A), two prompt versions, and four workflow systems (Nextflow, Parsl, Dask, and Airflow) were compared across three levels of workflow complexity. The results indicate that prompts with examples produce specifications that are syntactically more correct and semantically better aligned with the natural language description, with notable performance from GPT-4o and Dask. Nevertheless, challenges remain in generating complex workflows, particularly those involving parallelism.
Keywords:
LLM, Workflows
References
Babuji, Y. N. et al. (2019). Parsl: Pervasive parallel programming in python. In HPDC’19, pages 25–36. ACM.
Choi, H. K. and Li, Y. (2024). PICLe: Eliciting Diverse Behaviors from Large Language Models with Persona In-Context Learning. In PMLR’24, pages 8722–8739.
de Oliveira, D. et al. (2019). Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments. Morgan & Claypool Publishers.
Di Tommaso, P. et al. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4):316–319.
Dong, Q. et al. (2024). A Survey on In-context Learning. In Proc. of EMNLP’24, pages 1107–1128, Miami, Florida, USA. ACL.
Duque, A., Syed, A., Day, K. V., Berry, M. J., Katz, D. S., et al. (2023). Leveraging large language models to build and execute computational workflows.
Koziolek, H. et al. (2024). Llm-based and retrieval-augmented control code generation. In LLM4Code ’24, LLM4Code ’24, page 22–29, New York, NY, USA. ACM.
Matthew Rocklin (2015). Dask: Parallel Computation with Blocked algorithms and Task Scheduling. In Proc. of the 14th Python in Science Conference, pages 126 – 132.
Paiva, L. et al. (2025). Domínio delimitado, Ódio exposto: O uso de prompts para identificação de discurso de Ódio online com llms. In SBBD’25, Fortaleza, Brasil.
Sänger, M. et al. (2024). A qualitative assessment of using chatgpt as large language model for scientific workflow development. GigaScience, 13.
Vaswani, A. et al. (2017). Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Wei, J. et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. In NIPs’22, Red Hook, NY, USA. Curran Associates Inc.
Xu, J. et al. (2024). Llm4workflow: An llm-based automated workflow model generation tool. In ASE’24, ASE ’24, page 2394–2398, New York, NY, USA. ACM.
Yildiz, O. and Peterka, T. (2025). Do large language models speak scientific workflows? Zhang, X. et al. (2024). Massw: A new dataset and benchmark tasks for ai-assisted scientific workflows. arXiv preprint arXiv:2406.06357.
Choi, H. K. and Li, Y. (2024). PICLe: Eliciting Diverse Behaviors from Large Language Models with Persona In-Context Learning. In PMLR’24, pages 8722–8739.
de Oliveira, D. et al. (2019). Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments. Morgan & Claypool Publishers.
Di Tommaso, P. et al. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4):316–319.
Dong, Q. et al. (2024). A Survey on In-context Learning. In Proc. of EMNLP’24, pages 1107–1128, Miami, Florida, USA. ACL.
Duque, A., Syed, A., Day, K. V., Berry, M. J., Katz, D. S., et al. (2023). Leveraging large language models to build and execute computational workflows.
Koziolek, H. et al. (2024). Llm-based and retrieval-augmented control code generation. In LLM4Code ’24, LLM4Code ’24, page 22–29, New York, NY, USA. ACM.
Matthew Rocklin (2015). Dask: Parallel Computation with Blocked algorithms and Task Scheduling. In Proc. of the 14th Python in Science Conference, pages 126 – 132.
Paiva, L. et al. (2025). Domínio delimitado, Ódio exposto: O uso de prompts para identificação de discurso de Ódio online com llms. In SBBD’25, Fortaleza, Brasil.
Sänger, M. et al. (2024). A qualitative assessment of using chatgpt as large language model for scientific workflow development. GigaScience, 13.
Vaswani, A. et al. (2017). Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Wei, J. et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. In NIPs’22, Red Hook, NY, USA. Curran Associates Inc.
Xu, J. et al. (2024). Llm4workflow: An llm-based automated workflow model generation tool. In ASE’24, ASE ’24, page 2394–2398, New York, NY, USA. ACM.
Yildiz, O. and Peterka, T. (2025). Do large language models speak scientific workflows? Zhang, X. et al. (2024). Massw: A new dataset and benchmark tasks for ai-assisted scientific workflows. arXiv preprint arXiv:2406.06357.
Published
2025-09-29
How to Cite
WOYAMES, Paula; PINA, Débora; KUNSTMANN, Liliane; MATTOSO, Marta; DE OLIVEIRA, Daniel.
Evaluation of LLMs’ Capability to Specify Workflows. In: BRAZILIAN E-SCIENCE WORKSHOP (BRESCI), 19. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 81-88.
ISSN 2763-8774.
DOI: https://doi.org/10.5753/bresci.2025.248218.
