LLM-Generated Dataset for Speech-Driven 3D Facial Animation Models with Text-Controlled Expressivity
Resumo
The field of speech-driven 3D facial animation has a fundamental limitation regarding style control by natural language: scarcity of datasets that pair nuanced language with direct, controllable facial parameters. To address this bottleneck, we introduce two core contributions: (1) a large-scale, synthetically generated dataset of over 53,000 items that link emotional transcripts, descriptive tags, and explicit facial descriptions to 51-dimensional blendshape vectors; and (2) a CLIP-based alignment module that learns a shared semantic manifold between this text and the corresponding expressions. Our model employs a Transformer-based blendshape autoencoder and a frozen CLIP text encoder, optimized via a cross-modal objective. We evaluate the structure of this learned manifold using t-SNE projections and latent space traversals, which demonstrate semantically coherent clustering and smooth, plausible transitions in expression. We release our dataset specification, generation methodology, and training recipe to facilitate reproducible research in text-driven facial animation.Referências
K. Wang, Q. Wang, H. Peng, Z. Chen, J.-F. Wang, and C. Chen, “MEAD: A large-scale audio-visual dataset for emotional talking-face generation,” in European conference on computer vision. Springer, 2020, pp. 599–615.
A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” in 2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017). IEEE, 2017, pp. 1–8.
Y. Zhong, H. Wei, P. Yang, and Z. Wang, “ExpCLIP: Bridging text and facial expressions via semantic alignment,” 2023. [Online]. Available: [link]
P. Ekman and W. V. Friesen, “Facial action coding system,” Environmental Psychology & Nonverbal Behavior, 1978.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [Online]. Available: [link]
D. Cudeiro, T. Bolkart, C. Laidlaw, A. Ranjan, and M. J. Black, “Capture, learning, and synthesis of 3d speaking styles,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10 093–10 103.
H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang, “Talking face generation by adversarially disentangled audio-visual representation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 9299–9306.
T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen, “Audio-driven facial animation by joint end-to-end learning of pose and emotion,” ACM Trans. Graph., vol. 36, no. 4, 2017. [Online]. DOI: 10.1145/3072959.3073658
A. Richard, M. Zollhoefer, Y. Wen, F. de la Torre, and Y. Sheikh, “Meshtalk: 3d face animation from speech using cross-modality disentanglement,” 2022. [Online]. Available: [link]
A. Jain, B. Mildenhall, J. T. Barron, P. Abbeel, and B. Poole, “Zero-shot text-guided object generation with dream fields,” 2022. [Online]. Available: [link]
O. Michel, Z. Liu, V. Gkitsas, M. Saffar, D. N. Metaxas, and N. Paragios, “Text2mesh: Text-driven neural stylization for meshes,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
K. R. Scherer and H. G. Wallbott, “Evidence for universality and cultural variation of differential emotion response patterning,” Journal of Personality and Social Psychology, vol. 67, no. 1, p. 55, 1994. [Online]. DOI: 10.1037/0022-3514.67.1.55
D. Demszky, D. Movshovitz-Attias, J. Ko, A. Cowen, G. Nemade, and S. Ravi, “GoEmotions: A dataset of fine-grained emotions,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds. Online: Association for Computational Linguistics, Jul. 2020, pp. 4040–4054. [Online]. Available: [link]
S. Mohammad and F. Bravo-Marquez, “Emotion intensities in tweets,” in Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), N. Ide, A. Herbelot, and L. Màrquez, Eds. Vancouver, Canada: Association for Computational Linguistics, Aug. 2017, pp. 65–77. [Online]. Available: [link]
J. Wei and K. Zou, “EDA: Easy data augmentation techniques for boosting performance on text classification tasks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan, Eds. Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 6382–6388. [Online]. Available: [link]
A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” in 2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017). IEEE, 2017, pp. 1–8.
Y. Zhong, H. Wei, P. Yang, and Z. Wang, “ExpCLIP: Bridging text and facial expressions via semantic alignment,” 2023. [Online]. Available: [link]
P. Ekman and W. V. Friesen, “Facial action coding system,” Environmental Psychology & Nonverbal Behavior, 1978.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [Online]. Available: [link]
D. Cudeiro, T. Bolkart, C. Laidlaw, A. Ranjan, and M. J. Black, “Capture, learning, and synthesis of 3d speaking styles,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10 093–10 103.
H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang, “Talking face generation by adversarially disentangled audio-visual representation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 9299–9306.
T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen, “Audio-driven facial animation by joint end-to-end learning of pose and emotion,” ACM Trans. Graph., vol. 36, no. 4, 2017. [Online]. DOI: 10.1145/3072959.3073658
A. Richard, M. Zollhoefer, Y. Wen, F. de la Torre, and Y. Sheikh, “Meshtalk: 3d face animation from speech using cross-modality disentanglement,” 2022. [Online]. Available: [link]
A. Jain, B. Mildenhall, J. T. Barron, P. Abbeel, and B. Poole, “Zero-shot text-guided object generation with dream fields,” 2022. [Online]. Available: [link]
O. Michel, Z. Liu, V. Gkitsas, M. Saffar, D. N. Metaxas, and N. Paragios, “Text2mesh: Text-driven neural stylization for meshes,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
K. R. Scherer and H. G. Wallbott, “Evidence for universality and cultural variation of differential emotion response patterning,” Journal of Personality and Social Psychology, vol. 67, no. 1, p. 55, 1994. [Online]. DOI: 10.1037/0022-3514.67.1.55
D. Demszky, D. Movshovitz-Attias, J. Ko, A. Cowen, G. Nemade, and S. Ravi, “GoEmotions: A dataset of fine-grained emotions,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds. Online: Association for Computational Linguistics, Jul. 2020, pp. 4040–4054. [Online]. Available: [link]
S. Mohammad and F. Bravo-Marquez, “Emotion intensities in tweets,” in Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), N. Ide, A. Herbelot, and L. Màrquez, Eds. Vancouver, Canada: Association for Computational Linguistics, Aug. 2017, pp. 65–77. [Online]. Available: [link]
J. Wei and K. Zou, “EDA: Easy data augmentation techniques for boosting performance on text classification tasks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan, Eds. Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 6382–6388. [Online]. Available: [link]
Publicado
30/09/2025
Como Citar
CORRÊA, Pedro Rodrigues; COSTA, Paula Dornhofer Paro.
LLM-Generated Dataset for Speech-Driven 3D Facial Animation Models with Text-Controlled Expressivity. In: WORKSHOP ON VIRTUAL HUMANS - CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 38. , 2025, Salvador/BA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 331-334.
