Transferring Human Motion and Appearance in Monocular Videos
This thesis investigates the problem of transferring human motion and appearance from video to video preserving motion features, body shape, and visual quality. In other words, given two input videos, we investigate how to synthesize a new video, where a target person from the first video is placed into a new context performing different motions from the second video. Possible application domains are on graphics animations and entertainment media that rely on synthetic characters and virtual environments to create visual content. We introduce two novel methods for transferring appearance and retargeting human motion from monocular videos, and by consequence, increase the creative possibilities of visual content. Differently from recent appearance transferring methods, our approaches take into account 3D shape, appearance, and motion constraints. Specifically, our first method is based on a hybrid image-based rendering technique that exhibits competitive visual retargeting quality compared to state-of-the-art neural rendering approaches, even without computationally intensive training. Then, inspired by the advantages of the first method, we designed an end-to-end learning-based transferring strategy. Taking advantages of both differentiable rendering and the 3D parametric model, our second data-driven method produces a fully 3D controllable human model, i.e., the user can control the human pose and rendering parameters. Experiments on different videos show that our methods preserve specific features of the motion that must be maintained (e.g., feet touching the floor, hands touching a particular object) while holding the best values for appearance in terms of Structural Similarity (SSIM), Learned Perceptual Image Patch Similarity (LPIPS), Mean Squared Error (MSE), and Fréchet Video Distance (FVD). We also provide to the community a new dataset composed of several annotated videos with motion constraints for retargeting applications and paired motion sequences from different characters to evaluate transferring approaches.
B. Zhao, X. Wu, Z. Cheng, H. Liu, and J. Feng, "Multi-view image generation from a single-view," CoRR, 2017.
L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool, "Pose guided person image generation," in Advances in Neural Information Processing Systems, 2017, pp. 405-415.
C. Chan, S. Ginosar, T. Zhou, and A. Efros, "Everybody dance now," in ICCV, 2019.
P. Esser, E. Sutter, and B. Ommer, "A variational u-net for conditional appearance and shape generation," in CVPR, 2018.
W. Liu, Z. Piao, M. Jie, W. Luo, L. Ma, and S. Gao, "Liquid warping GAN: A unified framework for human motion imitation, appearance transfer and novel view synthesis," in ICCV, 2019.
T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro, "Video-to-video synthesis," in Advances in Neural Information Processing Systems (NeurIPS), 2018.
A. Tewari, O. Fried, J. Thies, V. Sitzmann, S. Lombardi, K. Sunkavalli, R. Martin-Brualla, T. Simon, J. Saragih, M. Niebner, R. Pandey, S. Fanello, G. Wetzstein, J.-Y. Zhu, C. Theobalt, M. Agrawala, E. Shechtman, D. B. Goldman, and M. Zollhofer, "State of the art on neural rendering," Computer Graphics Forum, vol. 39, no. 2, pp. 701- 727, 2020.
F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black, "Keep it smpl: Automatic estimation of 3d human pose and shape from a single image," in ECCV, 2016.
M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, "Smpl: A skinned multi-person linear model," ACM Trans. Graph., 2015.
C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black, and P. V. Gehler, "Unite the people: Closing the loop between 3d and 2d human representations," in CVPR, 2017.
A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, "End-to-end recovery of human shape and pose," in CVPR, 2018.
N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis, "Learning to reconstruct 3d human pose and shape via model-fitting in the loop," in ICCV, 2019.
S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li, "PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization," in IEEE/CVF International Conference on Computer Vision, 2019.
S. Saito, T. Simon, J. Saragih, and H. Joo, "PIFuHD: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization," in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
Z. Huang, Y. Xu, C. Lassner, H. Li, and T. Tung, "ARCH: animatable reconstruction of clothed humans," in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
R. Natsume, S. Saito, Z. Huang, W. Chen, C. Ma, H. Li, and S. Morishima, "Siclope: Silhouette-based clothed people," in CVPR, 2019.
V. Lazova, E. Insafutdinov, and G. Pons-Moll, "360-degree textures of people in clothing from a single image," in 2019 International Conference on 3D Vision, 3DV 2019, Québec City, QC, Canada, September 16-19, 2019. IEEE, 2019, pp. 643-653.
M. Gleicher, "Retargetting motion to new characters," in SIGGRAPH, 1998.
K.-J. Choi and H.-S. Ko, "On-line motion retargeting," Journal of Visualization and Computer Animation, 2000.
R. Villegas, J. Yang, D. Ceylan, and H. Lee, "Neural kinematic networks for unsupervised motion retargetting," in CVPR, June 2018.
X. B. Peng, A. Kanazawa, J. Malik, P. Abbeel, and S. Levine, "Sfv: Reinforcement learning of physical skills from videos," ACM Trans. Graph., vol. 37, no. 6, Nov. 2018.
K. Aberman, R. Wu, D. Lischinski, B. Chen, and D. Cohen-Or, "Learning character-agnostic motion for motion retargeting in 2d," ACM TOG, 2019.
S. B. Kang and H.-Y. Shum, "A review of image-based rendering techniques," 2000.
C. Zhang and T. Chen, "A survey on image-based rendering-representation, sampling and compression," Signal Processing: Image Communication, 2004.
H.-Y. Shum, S. B. Kang, and S.-C. Chan, "Survey of image-based representations and compression techniques," TCSVT, 2003.
M. Tatarchenko, A. Dosovitskiy, and T. Brox, "Single-view to multiview: Reconstructing unseen views with a convolutional network," CoRR, vol. abs/1511.06702, 2015. [Online]. Available: http://arxiv.org/abs/1511.06702
A. Dosovitskiy, J. T. Springenberg, and T. Brox, "Learning to generate chairs with convolutional neural networks," in 2015 IEEE Conference on CVPR, June 2015, pp. 1538-1546.
J. Yang, S. Reed, M.-H. Yang, and H. Lee, "Weakly-supervised disentangling with recurrent transformations for 3d view synthesis," in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS'15. Cambridge, MA, USA: MIT Press, 2015, pp. 1099-1107. [Online]. Available: http://dl.acm.org/citation.cfm?id=2969239.2969362
G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, and J. V. Guttag, "Synthesizing images of humans in unseen poses," in CVPR, 2018.
K. Aberman, M. Shi, J. Liao, D. Lischinski, B. Chen, and D. Cohen-Or, "Deep video-based performance cloning," CoRR, 2018.
H. Kato, D. Beker, M. Morariu, T. Ando, T. Matsuoka, W. Kehl, and A. Gaidon, "Differentiable rendering: A survey," arXiv preprint, 2020.
M. M. Loper and M. J. Black, "Opendr: An approximate differentiable renderer," in European Conference on Computer Vision, 2014.
H. Kato, Y. Ushiku, and T. Harada, "Neural 3d mesh renderer," in CVPR, 2018.
S. Liu, T. Li, W. Chen, and H. Li, "Soft rasterizer: A differentiable renderer for image-based 3d reasoning," The IEEE International Conference on Computer Vision (ICCV), Oct 2019.
M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger, "Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision," in Proceedings of the IEEE/CVF Conference on CVPR, June 2020.
Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, "Realtime multi-person 2d pose estimation using part affinity fields," in CVPR, 2017.
T. Simon, H. Joo, I. Matthews, and Y. Sheikh, "Hand keypoint detection in single images using multiview bootstrapping," in CVPR, 2017.
S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, "Convolutional pose machines," in CVPR, 2016.
C. De Boor, C. De Boor, E.-U. Mathématicien, C. De Boor, and C. De Boor, A practical guide to splines, 1978, vol. 27.
A. Criminisi, P. Perez, and K. Toyama, "Region filling and object removal by exemplar-based image inpainting," IEEE TIP, 2004.
Y.-T. Sun, Q.-C. Fu, Y.-R. Jiang, Z. Liu, Y.-K. Lai, H. Fu, and L. Gao, "Human motion transfer with 3d constraints and detail enhancement," 2020.
T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll, "Video based reconstruction of 3d people models," in CVPR, 2018.
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, "The unreasonable effectiveness of deep features as a perceptual metric," in CVPR, 2018.
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 13, no. 4, pp. 600- 612, 2004.
T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, "Towards accurate generative models of video: A new metric & challenges," 2019.
T. L. Gomes, R. Martins, J. Ferreira, R. Azevedo, G. Torres, and E. R. Nascimento, "A shape-aware retargeting approach to transfer human motion and appearance in monocular videos," International Journal of Computer Vision, Apr 2021. [Online]. Available: https://doi.org/10.1007/s11263-021-01471-x
T. L. Gomes, R. Martins, J. P. Ferreira, and E. R. Nascimento, "Do as i do: Transferring human motion and appearance between monocular videos with spatial and temporal constraints," in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Aspen, USA, 2020.
T. L. Gomes, T. M. Coutinho, R. Azevedo, R. Martins, and E. R. Nascimento, "Creating and reenacting controllable 3d humans with differentiable rendering," in WACV. IEEE, 2022, pp. 717-726.
J. P. Ferreira, T. M. Coutinho, T. L. Gomes, J. F. Neto, R. Azevedo, R. Martins, and E. R. Nascimento, "Learning to dance: A graph convolutional adversarial network to generate realistic dance motions from audio," Computers & Graphics, vol. 94, pp. 11 - 21, 2021.