Search and Retrieval of Workflows in Repositories using Transformers and Topic Modeling

  • Lyncoln S. Oliveira Federal University of Rio de Janeiro http://orcid.org/0000-0002-0015-0709
  • Annie Amorim Fluminense Federal University
  • Marcos Lage Fluminense Federal University
  • Aline Paes Fluminense Federal University
  • Daniel de Oliveira Fluminense Federal University

Abstract


Various repositories provide pre-modeled workflows for reuse and adaptation, given the inherent complexity of workflow modeling. Although these repositories offer labeling mechanisms, such labels are not always filled in, and when they are, their values can limit the search. An alternative way to perform searches in these repositories is to use natural language descriptions of workflows rather than being restricted to label-based searches or structural comparisons of workflows, which may be unfeasible. This paper presents the Athena++ approach, which uses natural language processing techniques to search for workflows in repositories, specifically using Transformers and Topic Modeling. The Athena++ was evaluated with a set of workflows obtained from the Galaxy repository, and the results were promising.
Keywords: scientific workflows, workflow retrieval, transformers, topic modeling

References

Blankenberg, D. et al. (2014). Dissemination of scientific software with galaxy toolshed. Genome Biology, 15(2):403.

Blei, D. M. (2012). Probabilistic topic models. Commun. of the ACM, 55(4):77–84.

Costa, F. et al. (2012). Athena: text mining based discovery of scientific workflows in disperse repositories. In RED 2010, Paris, France, pages 104–121. Springer.

de Oliveira, D., Liu, J., and Pacitti, E. (2019). Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments. Morgan & Claypool.

Dias, L. G. et al. (2024). Maestro: a lightweight ontology-based framework for composing and analyzing script-based scientific experiments. Knowledge and Information Systems.

Goble, C. A. et al. (2010). myexperiment: a repository and social network for the sharing of bioinformatics workflows. Nucleic Acids Res., 38:677–682.

Grootendorst, M. (2022). Bertopic: Neural topic modeling with a class-based tf-idf procedure. CoRR, abs/2203.05794.

Gu, Y., Cao, J., Qian, S., and Guan, W. (2023). Sworts: a scientific workflow retrieval approach by learning textual and structural semantics. IEEE Trans. on Services Computing.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. pages 3980–3990.

Silva, V. et al. (2011). Similarity-based workflow clustering. Journal of Computational Interdisciplinary Sciences, 2(1):23–35.

Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. In Intelligent Systems, pages 403–417, Cham. Springer International.

Starlinger, J. et al. (2016). Effective and efficient similarity search in scientific workflow repositories. Future Generation Computer Systems, 56:584–594.

Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., Gutierrez, L., Tan, T. F., and Ting, D. S. W. (2023). Large language models in medicine. Nature medicine, 29(8):1930–1940.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Zhou, Z., Cheng, Z., Zhang, L.-J., Gaaloul, W., and Ning, K. (2018). Scientific workflow clustering and recommendation leveraging layer hierarchical analysis. IEEE Transactions on Services Computing, 11(1):169–183.
Published
2024-10-14
OLIVEIRA, Lyncoln S.; AMORIM, Annie; LAGE, Marcos; PAES, Aline; OLIVEIRA, Daniel de. Search and Retrieval of Workflows in Repositories using Transformers and Topic Modeling. In: BRAZILIAN E-SCIENCE WORKSHOP (BRESCI), 18. , 2024, Florianópolis/SC. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 40-47. ISSN 2763-8774. DOI: https://doi.org/10.5753/bresci.2024.243907.