Rust como Engine para Transformers: FFI Estável, Benchmarks e Controle de Dispositivo

Daniel Pontes; Murilo Salem; Marcos Alves; Karen Satie; Gerson Geraldo H. Cavalheiro

doi:10.5753/eradrs.2026.21497

Daniel Pontes UFPel
Murilo Salem UFPel
Marcos Alves UFPel
Karen Satie UFPel
Gerson Geraldo H. Cavalheiro UFPel

DOI: https://doi.org/10.5753/eradrs.2026.21497

Resumo

Este artigo apresenta a engine Rust usada pelo projeto neural-lm-hpc para executar workloads Transformer. A engine concentra armazenamento tensorial, modelo, tokenizer, otimizador, exports FFI e backends CPU/CUDA, enquanto scripts reprodutíveis de benchmark reutilizam um runner de profiling simples. As principais contribuições sao um runtime com controle explícito de dispositivo, uma C ABI estável consumida pelo Go e scripts operacionais para throughput, latência e memória. Os resultados em CPU-only mostram o custo esperado de escalar de 125M para 1.3B parâmetros e estabelecem uma baseline reprodutível para futuras medições em CUDA.

Referências

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. (2022). Flashattention: Fast and memory-efficient exact attention with io-awareness.

Kwon, W., Lee, Z., Li, S., Zhuang, Y., Sheng, Y., Zheng, L., Yu, C., Gonzalez, J. E., Zhang, H., and Stoica, I. (2023). Efficient memory management for large language model serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, pages 611–626. ACM.

Rasley, J., Rajbhandari, S., Ruwase, O., and He, Y. (2020). DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 4739–4740. ACM.

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. (2019). Megatron-lm: Training multi-billion parameter language models using model parallelism.

Su, J., Lu, Y., Pan, S., Wen, B., and Liu, Y. (2021). Roformer: Enhanced transformer with rotary position embedding.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and fine-tuned chat models.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc.

Zhang, B. and Sennrich, R. (2019). Root mean square layer normalization. Advances in Neural Information Processing Systems, 32.