Synthesizing Realistic Human Dance Motions Conditioned by Musical Data using Graph Convolutional Networks

João P. M. Ferreira; Renato Martins; Erickson R. Nascimento

doi:10.5753/ctd.2021.15762

João P. M. Ferreira UFMG http://orcid.org/0000-0002-8093-9880
Renato Martins INRIA https://orcid.org/0000-0003-0053-0004
Erickson R. Nascimento UFMG https://orcid.org/0000-0003-2973-2232

DOI: https://doi.org/10.5753/ctd.2021.15762

Resumo

Learning to move naturally from music, i.e., to dance, is one of the most complex motions humans often perform effortlessly. Existing techniques of automatic dance generation with classical CNN and RNN models undergo training and variability issues due to the non-Euclidean geometry of the motion manifold. We design a novel method based on GCNs to tackle the problem of automatic dance generation from audio. Our method uses an adversarial learning scheme conditioned on the input music audios to create natural motions. The results demonstrate that the proposed GCN model outperforms the state-of-the-art in different experiments. Visual results of the motion generation and explanation can be visualized through the link: http://youtu.be/fGDK6UkKzvA

Palavras-chave: Human motion generation, Sound and dance processing, Multimodal learning, Conditional adversarial nets, Graph convolutional neural networks

Referências

Aytar, Y., Vondrick, C., and Torralba, A. (2016). Soundnet: Learning sound representations from unlabeled video. In Advances in neural information processing systems.

Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., and Sheikh, Y. A. (2019). Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Ferreira, J. P., Coutinho, T. M., Gomes, T. L., Neto, J. F., Azevedo, R., Martins, R., and Nascimento, E. R. (2020). Learning to dance: A graph convolutional adversarial network to generate realistic dance motions from audio. Computers & Graphics.

Gomes, T. L., Martins, R., Ferreira, J., Azevedo, R., Torres, G., and Nascimento, E. R. (2021). A shape-aware retargeting approach to transfer human motion and appearance in monocular videos. International Journal of Computer Vision.

Gomes, T. L., Martins, R., Ferreira, J., and Nascimento, E. R. (2020). Do as I do: transferring human motion and appearance between monocular videos with spatial and temporal constraints. In IEEE Conference on Applications of Computer Vision (WACV).

Huang, R., Hu, H., Wu, W., Sawada, K., and Zhang, M. (2021). Dance revolution: Long sequence dance generation with music via curriculum learning. In ICLR 2021.

Kipf, T. N. and Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In ICLR.

Lee, H.-Y., Yang, X., Liu, M.-Y., Wang, T.-C., Lu, Y.-D., Yang, M.-H., and Kautz, J. (2019). Dancing to music. In Advances in Neural Information Processing Systems.

Li, J., Yin, Y., Chu, H., Zhou, Y., Wang, T., Fidler, S., and Li, H. (2020). Learning to generate diverse dance motions with transformer. arXiv preprint arXiv:2008.08171.

Li, R., Yang, S., Ross, D. A., and Kanazawa, A. (2021). Learn to dance with aist++: Music conditioned 3d dance generation. In eprint arXiv: 2101.08779.

Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Liu, G., Tao, A., Kautz, J., and Catanzaro, B. (2018). Video-to-video synthesis. In Conference on Neural Information Processing Systems.

Yan, S., Xiong, Y., and Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI Conference on Artificial Intelligence.