“Visemenet: audio-driven animator-centric speech animation” by Zhou, Xu, Landreth, Kalogerakis, Maji, et al. …

  • ©Yang Zhou, Zhan Xu, Christopher (Chris) Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh



Entry Number: 161

Session Title:

    Portraits & Speech


    Visemenet: audio-driven animator-centric speech animation




    We present a novel deep-learning based approach to producing animator-centric speech motion curves that drive a JALI or standard FACS-based production face-rig, directly from input audio. Our three-stage Long Short-Term Memory (LSTM) network architecture is motivated by psycho-linguistic insights: segmenting speech audio into a stream of phonetic-groups is sufficient for viseme construction; speech styles like mumbling or shouting are strongly co-related to the motion of facial landmarks; and animator style is encoded in viseme motion curve profiles. Our contribution is an automatic real-time lip-synchronization from audio solution that integrates seamlessly into existing animation pipelines. We evaluate our results by: cross-validation to ground-truth data; animator critique and edits; visual comparison to recent deep-learning lip-synchronization solutions; and showing our approach to be resilient to diversity in speaker and language.


    1. Robert Anderson, Björn Stenger, Vincent Wan, and Roberto Cipolla. 2013. Expressive Visual Text-to-Speech Using Active Appearance Models. In Proc. CVPR. Google ScholarDigital Library
    2. Gérard Bailly. 1997. Learning to speak. Sensori-motor control of speech movements. Speech Communication 22, 2-3 (1997). Google ScholarDigital Library
    3. Gérard Bailly, Pascal Perrier, and Eric Vatikiotis-Bateson. 2012. Audiovisual Speech Processing. Cambridge University Press.Google Scholar
    4. Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving Visual Speech with Audio. In Proc. SIGGRAPH. Google ScholarDigital Library
    5. Carlos Busso, Sungbok Lee, and Shrikanth S. Narayanan. 2007. Using neutral speech models for emotional speech analysis. In Proc. InterSpeech.Google Scholar
    6. Rich Caruana. 1997. Multi-task Learning. Machine Learning 28, 1 (1997). Google ScholarDigital Library
    7. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Y. Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. EMNLP.Google Scholar
    8. Michael M Cohen and Dominic W Massaro. 1993. Modeling Coarticulation in Synthetic Visual Speech. Models and Techniques in Computer Animation 92 (1993).Google Scholar
    9. Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition. J. Acoustical Society of America 120, 5 (2006).Google ScholarCross Ref
    10. Steven B. Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoustics, Speech and Signal Processing 28, 4 (1980).Google ScholarCross Ref
    11. Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: An Animator-centric Viseme Model for Expressive Lip Synchronization. ACM Trans. Graphics. 35, 4 (2016). Google ScholarDigital Library
    12. Paul Ekman and Wallace V. Friesen. 1978. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press.Google Scholar
    13. Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic Speech Animation. In Proc. SIGGRAPH. Google ScholarDigital Library
    14. Faceware. 2017. Analyzer. http://facewaretech.com/products/software/analyzer. (2017).Google Scholar
    15. G. Fanelli, J. Gall, H. Romsdorfer, T. Weise, and L. Van Gool. 2010. A 3-D Audio-Visual Corpus of Affective Communication. IEEE Trans. Multimedia 12, 6 (2010). Google ScholarDigital Library
    16. Cletus G Fisher. 1968. Confusions among visually perceived consonants. J. Speech, Language, and Hearing Research 11, 4 (1968).Google Scholar
    17. Jennifer MB Fugate. 2013. Categorical perception for emotional faces. Emotion Review 5, 1 (2013).Google ScholarCross Ref
    18. Google. 2017. Google Cloud Voice. https://cloud.google.com/speech. (2017).Google Scholar
    19. Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proc. ICML. Google ScholarDigital Library
    20. Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005). Google ScholarDigital Library
    21. S. Haq and P.J.B. Jackson. 2009. Speaker-dependent audio-visual emotion recognition. In Proc. AVSP.Google Scholar
    22. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997). Google ScholarDigital Library
    23. Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. 2017. Avatar Digitization from a Single Image for Real-time Rendering. ACM Trans. Graphics. 36, 6 (2017). Google ScholarDigital Library
    24. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2016. Image-to-image Translation with Conditional Adversarial Networks. arxiv abs/1611.07004 (2016).Google Scholar
    25. Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven Facial Animation by Joint End-to-end Learning of Pose and Emotion. ACM Trans. Graphics. 36, 4 (2017). Google ScholarDigital Library
    26. Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit. J. of Machine Learning Research 10, Jul (2009). Google ScholarDigital Library
    27. Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime Facial Animation with on-the-Fly Correctives. ACM Trans. Graphics 32, 4 (2013). Google ScholarDigital Library
    28. Alvin M Liberman, Katherine Safford Harris, Howard S Hoffman, and Belver C Griffith. 1957. The discrimination of speech sounds within and across phoneme boundaries. J. Experimental Psychology 54, 5 (1957).Google ScholarCross Ref
    29. Karl F. MacDorman, Robert D. Green, Chin-Chang Ho, and Clinton T Koch. 2009. Too real for comfort? Uncanny responses to computer generated faces. Computers in Human Behavior 25, 3 (2009). Google ScholarDigital Library
    30. Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal Forced Aligner: trainable text-speech alignment using Kaldi. In Proc. Interspeech.Google ScholarCross Ref
    31. Christopher Olah. 2015. Understanding LSTM Networks. http://colah.github.io/posts/2015-08-Understanding-LSTMs. (2015).Google Scholar
    32. Kuldip K Paliwal. 1998. Spectral subband centroid features for speech recognition. In Proc. ICASSP.Google ScholarCross Ref
    33. Roger Blanco i Ribera, Eduard Zell, J. P. Lewis, Junyong Noh, and Mario Botsch. 2017. Facial Retargeting with Automatic Range of Motion Alignment. ACM Trans. Graphics. 36, 4 (2017). Google ScholarDigital Library
    34. Supasorn Suwajanakorn, Steven M. Seitz, and I. Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning Lip Sync from Audio. ACM Trans. Graphics 36, 4 (2017). Google ScholarDigital Library
    35. Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A Deep Learning Approach for Generalized Speech Animation. ACM Trans. Graphics. 36, 4 (2017). Google ScholarDigital Library
    36. Sarah L. Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic Units of Visual Speech. In Proc. SCA. Google ScholarDigital Library
    37. Lijuan Wang, Wei Han, and Frank K Soong. 2012. High Quality Lip-Sync Animation for 3D Photo-Realistic Talking Head. In Proc. ICASSP.Google ScholarCross Ref
    38. Wenwu Wang. 2010. Machine Audition: Principles, Algorithms and Systems: Principles, Algorithms and Systems. IGI Global. Google ScholarDigital Library
    39. Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. 2011. Realtime performance-based facial animation. In ACM Trans. Graphics, Vol. 30. Google ScholarDigital Library
    40. Lance Williams. 1990. Performance-driven Facial Animation. In Proc. SIGGRAPH.Google ScholarDigital Library

ACM Digital Library Publication: