Proposal for comparison and measurement of parallel and distributed file systems for training ML models in the healthcare

  • João V. Vargas UFRGS
  • Cristiano A. Künas UFRGS
  • Thiago Araújo UFRGS
  • Bruno Morales UFRGS
  • Philippe O. A. Navaux UFRGS

Abstract


Many scientific fields are increasingly relying on high-performance computing (HPC) to handle and analyze vast amounts of experimental data. At the same time, storage systems in modern HPC environments must adapt to different access patterns. These patterns involve frequent metadata operations, numerous small I/O requests, and randomized file access, whereas traditional parallel file systems have been optimized primarily for sequential and shared access to large files. In this research, we will compare GekkoFS and evaluate its performance against Lustre, a widely used parallel file system that meets the demanding requirements of HPC simulation environments. Our comparison aims to highlight the strengths and limitations of each system for training ML models in healthcare.

Keywords: Machine Learning and High Performance Computing, High Performance File Systems and Input/Output

References

Braam, P. (2019). The lustre storage architecture. arXiv preprint [link].

Dos Reis, M. A., Kunas, C. A., da Silva Araújo, T., Schneiders, J., de Azevedo, P. B., Nakayama, L. F., Rados, D. R., Umpierre, R. N., Berwanger, O., Lavinsky, D., et al. (2024). Advancing healthcare with artificial intelligence: diagnostic accuracy of machine learning algorithm in diagnosis of diabetic retinopathy in the brazilian population. Diabetology & Metabolic Syndrome, 16(1):209.

Gupta, A., Dhakshinamoorthy, D., and Paul, A. K. (2024). Studying the effects of asynchronous i/o on hpc i/o patterns. In 2024 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops), pages 109–112. IEEE.

Macedo, R., Miranda, M., Tanimura, Y., Haga, J., Ruhela, A., Harrell, S. L., Evans, R. T., Pereira, J., and Paulo, J. (2023). Taming metadata-intensive hpc jobs through dynamic, application-agnostic qos control. In 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pages 47–61. IEEE.

Samsi, S., Zhao, D., McDonald, J., Li, B., Michaleas, A., Jones, M., Bergeron, W., Kepner, J., Tiwari, D., and Gadepally, V. (2023). From words to watts: Benchmarking the energy costs of large language model inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–9. IEEE.

Vef, M.-A., Moti, N., Suß, T., Tacke, M., Tocci, T., Nou, R., Miranda, A., Cortes, T., and Brinkmann, A. (2020). Gekkofs—a temporary burst buffer file system for hpc applications. Journal of Computer Science and Technology, 35:72–91.
Published
2025-04-23
VARGAS, João V.; KÜNAS, Cristiano A.; ARAÚJO, Thiago; MORALES, Bruno; NAVAUX, Philippe O. A.. Proposal for comparison and measurement of parallel and distributed file systems for training ML models in the healthcare. In: REGIONAL SCHOOL OF HIGH PERFORMANCE COMPUTING FROM SOUTHERN BRAZIL (ERAD-RS), 25. , 2025, Foz do Iguaçu/PR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 105-108. ISSN 2595-4164. DOI: https://doi.org/10.5753/eradrs.2025.6807.