Synthesizing Realistic Human Dance Motions Conditioned by Musical Data using Graph Convolutional Networks
Resumo
Learning to move naturally from music, i.e., to dance, is one of the most complex motions humans often perform effortlessly. Synthesizing human motion through learning techniques is becoming an increasingly popular approach to alleviating the requirement of new data capture to produce animations. Most approaches, addressing the problem of automatic dance motion synthesis with classical convolutional and recursive neural models, undergo training and variability issues due to the non-Euclidean geometry of the motion manifold structure. In this thesis, we design a novel method based on graph convolutional networks, that overcome the aforementioned issues, to tackle the problem of automatic dance generation from audio information. Our method uses an adversarial learning scheme conditioned on the input music audios to create natural motions preserving the key movements of different music styles. We also collected, annotated and made publicly available a novel multimodal dataset with paired audio, motion data and videos of people dancing three different music styles, as a common ground to evaluate dance generation approaches. The results suggest that the proposed GCN model outperforms the state-of-the-art dance generation method conditioned on music in different experiments. Moreover, our graph-convolutional approach is simpler, easier to be trained, and capable of generating more realistic motion styles regarding qualitative and different quantitative metrics. It also presents a visual movement perceptual quality comparable to real motion data. The dataset, source code, and qualitative results are available on the project's webpage: https://verlab.github.io/Learning2Dance_CAG_2020/.
Referências
M. Leman, “The role of embodiment in the perception of music,” Empirical Musicology Review, vol. 9, no. 3-4, pp. 236–246, 2014.
S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik, “Learning individual styles of conversational gesture,” in CVPR, 2019.
S. Yan, Z. Li, Y. Xiong, H. Yan, and D. Lin, “Convolutional sequence generation for skeleton-based action synthesis,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, 2014, pp. 2672– 2680.
X. Ren, H. Li, Z. Huang, and Q. Chen, “Music-oriented dance video synthesis with pose perceptual loss,” arXiv preprint arXiv:1912.06606, 2019.
J. Li, Y. Yin, H. Chu, Y. Zhou, T. Wang, S. Fidler, and H. Li, “Learning to generate diverse dance motions with transformer,” arXiv preprint arXiv:2008.08171, 2020.
R. Huang, H. Hu, W. Wu, K. Sawada, and M. Zhang, “Dance revolution: Long sequence dance generation with music via curriculum learning,” in ICLR 2021, 2021.
T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2017.
D. Cudeiro, T. Bolkart, C. Laidlaw, A. Ranjan, and M. J. Black, “Capture, learning, and synthesis of 3d speaking styles,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10 101– 10 111.
A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng, “Deep speech: Scaling up end-to-end speech recognition,” 2014.
H.-Y. Lee, X. Yang, M.-Y. Liu, T.-C. Wang, Y.-D. Lu, M.-H. Yang, and J. Kautz, “Dancing to music,” in Advances in Neural Information Processing Systems, 2019.
M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in AAAI Conference on Artificial Intelligence, 2018.
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2D human pose estimation: New benchmark and state of the art analysis,” in CVPR, 2014, pp. 3686–3693.
R. A. Güler, N. Neverova, and I. Kokkinos, “Densepose: Dense human pose estimation in the wild,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7297–7306.
Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
N. Kolotouros, G. Pavlakos, M. Black, and K. Daniilidis, “Learning to reconstruct 3d human pose and shape via model-fitting in the loop,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 2252–2261.
A. Kanazawa, J. Y. Zhang, P. Felsen, and J. Malik, “Learning 3d human dynamics from video,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5607–5616.
M. Kocabas, N. Athanasiou, and M. J. Black, “Vibe: Video inference for human body pose and shape estimation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
L.-Y. Gui, Y.-X. Wang, X. Liang, and J. M. Moura, “Adversarial geometry-aware human motion prediction,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 786–803.
P. Ghosh, J. Song, E. Aksan, and O. Hilliges, “Learning human motion models for long-term predictions,” in 2017 International Conference on 3D Vision (3DV), 2017, pp. 458–466.
K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik, “Recurrent network models for human dynamics,” in ICCV, 2015, pp. 4346–4354.
H. Wang, E. S. L. Ho, H. P. H. Shum, and Z. Zhu, “Spatio-temporal manifold learning for human motions via long-horizon modeling,” IEEE Transactions on Visualization and Computer Graphics, pp. 1–1, 2019.
R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ser. ICML’13. JMLR.org, 2013, p. III–1310–III–1318.
T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro, “Video-to-video synthesis,” in Conference on Neural Information Processing Systems, 2018.
C. Chan, S. Ginosar, T. Zhou, and A. Efros, “Everybody dance now,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 5932–5941.
T. L. Gomes, R. Martins, J. Ferreira, and E. R. Nascimento, “Do as I do: transferring human motion and appearance between monocular videos with spatial and temporal constraints,” in IEEE Conference on Applications of Computer Vision (WACV), 2020.
S. Xia, C. Wang, J. Chai, and J. Hodgins, “Realtime style transfer for unlabeled heterogeneous human motion,” ACM Transactions on Graphics (TOG), vol. 34, no. 4, pp. 1–10, 2015.
H. J. Smith, C. Cao, M. Neff, and Y. Wang, “Efficient neural networks for real-time motion style transfer,” Proceedings of the ACM on Computer Graphics and Interactive Techniques, vol. 2, no. 2, pp. 1–17, 2019.
M. Gleicher, “Retargetting motion to new characters,” in Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH ’98. New York, NY, USA: ACM, 1998, pp. 33–42.
K.-J. Choi and H.-S. Ko, “On-line motion retargeting,” Journal of Visualization and Computer Animation, vol. 11, pp. 223–235, 12 2000.
R. Villegas, J. Yang, D. Ceylan, and H. Lee, “Neural kinematic networks for unsupervised motion retargetting,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Q. Wang, T. Artières, M. Chen, and L. Denoyer, “Adversarial learning for modeling human motion,” The Visual Computer, vol. 36, no. 1, pp. 141–160, 2020.
D.-K. Jang and S.-H. Lee, “Constructing human motion manifold with sequential networks,” Computer Graphics Forum, 2020.
Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learning sound representations from unlabeled video,” in Advances in neural information processing systems, 2016.
C. E. Rasmussen, “Gaussian processes in machine learning,” in Summer School on Machine Learning. Springer, 2003, pp. 63–71.
K. Shmelkov, C. Schmid, and K. Alahari, “How good is my GAN?” in ECCV, 2018, pp. 213–229.
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” in 9th ISCA Speech Synthesis Workshop, 2016, pp. 125–125.
J. P. Ferreira, T. M. Coutinho, T. L. Gomes, J. F. Neto, R. Azevedo, R. Martins, and E. R. Nascimento, “Learning to dance: A graph convolutional adversarial network to generate realistic dance motions from audio,” Computers & Graphics, 2020.
T. L. Gomes, R. Martins, J. Ferreira, and E. R. Nascimento, “A shapeaware retargeting approach to transfer human motion and appearance in monocular videos,” 2021, international Journal of Computer Vision - IJCV.