Disentangled Phoneme-Prosody Mapping for Controllable 3D Facial Animation

Danzel Serrano; Przemyslaw Musialski

“Disentangled Phoneme-Prosody Mapping for Controllable 3D Facial Animation” by Serrano and Musialski

Next: “Disentangling random and cyclic effects in... »

« Previous: “Disentangled Image Colorization via Global...

Conference:

SIGGRAPH 2025

Type(s):

Posters

Title:

Disentangled Phoneme-Prosody Mapping for Controllable 3D Facial Animation

Session/Category Title:

Animation & Simulation

Presenter(s)/Author(s):

Danzel Serrano

Przemyslaw Musialski

Abstract:

Speech driven 3D face animation driven by disentangled phoneme and prosody features, enabling fine-grained and intuitive control over visemes and expressions—uses a convolutional autoencoder to learn a relative motion prior and a transformer to map these interpretable audio features into latent deformations.

References:

[1] Shivangi Aneja, Artem Sevastopolsky, Tobias Kirschstein, Justus Thies, Angela Dai, and Matthias Nießner. 2024. GaussianSpeech: Audio-Driven Gaussian Avatars. arxiv:https://arXiv.org/abs/2411.18675 [cs.CV] https://arxiv.org/abs/2411.18675
[2] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020).
[3] Cameron Churchwell, Max Morrison, and Bryan Pardo. 2024. High-Fidelity Neural Phonetic Posteriorgrams. arxiv:https://arXiv.org/abs/2402.17735 [eess.AS] https://arxiv.org/abs/2402.17735
[4] Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael Black. 2019. Capture, Learning, and Synthesis of 3D Speaking Styles. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 10101–10111. http://voca.is.tue.mpg.de/
[5] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. arXiv preprint arXiv:https://arXiv.org/abs/2106.07447 (2021).
[6] Zeynep Inanoglu and Steve J Young. 2007. A system for transforming the emotion in speech: combining data-driven conversion techniques for prosody and voice quality. In INTERSPEECH. 490–493.
[7] Max Morrison, Cameron Churchwell, Nathan Pruyne, and Bryan Pardo. 2024a. Fine-Grained and Interpretable Neural Speech Editing. arxiv:https://arXiv.org/abs/2407.05471 [eess.AS] https://arxiv.org/abs/2407.05471
[8] Max Morrison, Caedon Hsieh, Nathan Pruyne, and Bryan Pardo. 2024b. Cross-domain Neural Pitch and Periodicity Estimation. arxiv:https://arXiv.org/abs/2301.12258 [eess.AS] https://arxiv.org/abs/2301.12258
[9] Qingcheng Zhao, Pengyu Long, Qixuan Zhang, Dafei Qin, Han Liang, Longwen Zhang, Yingliang Zhang, Jingyi Yu, and Lan Xu. 2024. Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance. arxiv:https://arXiv.org/abs/2401.15687 [cs.CV] https://arxiv.org/abs/2401.15687

ACM Digital Library Publication:

Disentangled Phoneme-Prosody Mapping for Controllable 3D Facial Animation

Overview Page:

SIGGRAPH 2025: Posters

Submit a story:

If you would like to submit a story about this presentation, please contact us: historyarchives@siggraph.org

ACM SIGGRAPH HISTORY ARCHIVES