MoLang: Leveraging General-Purpose Language Models for Human Animation

Emanuel Borges Passinato; Walcy Santos Rezende Rios; Rafael Teixeira Sousa; Arlindo Rodrigues Galvão Filho

doi:10.5753/svr_estendido.2025.15716

Emanuel Borges Passinato UFG
Walcy Santos Rezende Rios UFG
Rafael Teixeira Sousa UFMT
Arlindo Rodrigues Galvão Filho UFG

DOI: https://doi.org/10.5753/svr_estendido.2025.15716

Resumo

Aplicações em realidade virtual (VR) e mídias interativas demandam cada vez mais métodos para gerar movimentos humanos que sejam ao mesmo tempo realistas e controláveis. Este artigo apresenta o MotionLLM, um framework, em desenvolvimento, para síntese de movimento a partir de texto que, aproveita o poder dos Large Language Models (LLMs). Nossa abordagem começa tokenizando movimentos 3D contínuos em uma sequência discreta, utilizando um Residual Vector Quantized Variational Autoencoder (RQ-VAE), adaptando a estratégia de tokenização do MoMask. Em seguida, reformulamos a geração de movimento como uma tarefa de modelagem de linguagem autoregressiva, na qual um LLM pré-treinado gera tokens de movimento condicionados ao texto. Nossa hipótese é que LLMs são especialmente adequados para produzir sequências de movimento longas e coerentes, oferecendo uma arquitetura escalável e possibilitando controle multilíngue e multimodal.

Referências

Abootorabi, M. M., Ghahroodi, O., Zahraei, P. S., Behzadasl, H., Mirrokni, A., Salimipanah, M., Rasouli, A., Behzadipour, B., Azarnoush, S., Maleki, B., Sadraiye, E., Feriz, K. K., Nahad, M. T., Moghadasi, A., Abianeh, A. E., Nazar, N., Rabiee, H. R., Baghshah, M. S., Ahmadi, M., and Asgari, E. (2025). Generative ai for character animation: A comprehensive survey of techniques, applications, and future directions.

Chen, R., Wang, Z., Jiang, J., Wu, Z., Liu, X., Song, C.-Z., and Liu, F. (2024). Diffsheg: A diffusion-based method for parameterized sign language production. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Gong, Y., Zhao, Z., Zhang, J., Wang, S., Zhu, W., Chen, X., Ma, C., Liu, M., Xu, C., Wen, J., Wu, Y., Chen, C., Yang, J., Jiang, T., Liu, H., Ma, X., and Ci, H. (2023). Text-driven motion generation: Overview, challenges and directions. arXiv preprint arXiv:2305.09379.

Guo, C., Mu, Y., Javed, M. G., Wang, S., and Cheng, L. (2024). Momask: Generative masked modeling of 3d human motions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910.

Guo, C., Wang, X., Zou, S., Zuo, Y., Wang, S., Wu, W., Li, G., and Salsbury, J. K. (2023). Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems (NeurIPS), 36.

Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., and Cheng, L. (2022). Generating diverse and natural 3d human motions from text. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5142–5151.

Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851.

Jiang, M., Tang, S., Jin, Z., Liu, Z., and Liu, W. (2023). Finemogen: Fine-grained motion generation and editing with spatio-temporal mixture attention. In Advances in Neural Information Processing Systems (NeurIPS), volume 36.

Tseng, C.-Y., Nakazawa, A., and Harada, T. (2023). Motiongpt: Human motion as a foreign language. In Advances in Neural Information Processing Systems (NeurIPS), volume 36.

Van Den Oord, A., Vinyals, O., et al. (2017). Neural discrete representation learning. Advances in neural information processing systems, 30.

Wu, A., Zhang, K.-Y., Zhang, T.-L., Pan, J.-J., and Zhang, X.-H. (2024). Motioncraft: A unified framework for controllable human motion generation. In arXiv preprint arXiv:2403.11186.

Yun, H., Ponton, J. L., Andujar, C., and Pelechano, N. (2023). Animation fidelity in self-avatars: Impact on user performance and sense of agency. In 2023 IEEE Conference Virtual Reality and 3D User Interfaces (VR), page 286–296. IEEE.

Zhang, M., Jiang, Z., Liu, S., Zhou, A., Wang, S., and Zhao, Y. (2022). Motiondiffuse: Text-driven human motion generation with diffusion model. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2516–2525.

Zhao, J., Weng, D., Du, Q., and Tian, Z. (2024). Motion generation review: Exploring deep learning for lifelike animation with manifold.

Zhu, W., Ma, X., Ro, D., Ci, H., Zhang, J., Shi, J., Gao, F., Tian, Q., and Wang, Y. (2024). Human Motion Generation: A Survey . IEEE Transactions on Pattern Analysis & Machine Intelligence, 46(04):2430–2449.