Go como Orquestrador de Treinamento e Serving para Cargas de Trabalho Transformer

Daniel Pontes; Murilo Salem; Marcos Alves; Karen Ono; Gerson Geraldo H. Cavalheiro

doi:10.5753/eradrs.2026.21496

Daniel Pontes UFPel
Murilo Salem UFPel
Marcos Alves UFPel
Karen Ono UFPel
Gerson Geraldo H. Cavalheiro UFPel

DOI: https://doi.org/10.5753/eradrs.2026.21496

Resumo

Este artigo descreve o estado operacional atual de um orquestrador em Go para workloads Transformer no projeto neural-lm-hpc. O Go concentra configuração, controle de treino, checkpointing, serving gRPC e observabilidade Prometheus, enquanto a execução numérica permanece contida em uma engine Rust. A principal contribuição é uma camada de orquestração com baixo acoplamento, que preserva a configuração original do modelo, isola a fronteira FFI (Foreign Function Interface) através de interfaces pequenas e torna o serving testável de forma independente da engine numérica. O resultado é uma base prática para ciclos curtos de experimentação.

Referências

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. (2022). Flashattention: Fast and memory-efficient exact attention with io-awareness.

Kwon, W., Lee, Z., Li, S., Zhuang, Y., Sheng, Y., Zheng, L., Yu, C., Gonzalez, J. E., Zhang, H., and Stoica, I. (2023). Efficient memory management for large language model serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, pages 611–626. ACM.

Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Korthikanti, V. A., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., Phanishayee, A., et al. (2021). Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, pages 1–15. ACM.

Rasley, J., Rajbhandari, S., Ruwase, O., and He, Y. (2020). DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 4739–4740. ACM.

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. (2019). Megatron-lm: Training multi-billion parameter language models using model parallelism.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc.