Visemenet: audio-driven animator-centric speech animation

We present a novel deep-learning based approach to producing animator-centric speech motion curves that drive a JALI or standard FACS-based production face-rig, directly from input audio. Our three-stage Long Short-Term Memory (LSTM) network architecture is motivated by psycho-linguistic insights: segmenting speech audio into a stream of phonetic-groups is sufficient for viseme construction; speech styles like mumbling or shouting are strongly co-related to the motion of facial landmarks; and animator style is encoded in viseme motion curve profiles. Our contribution is an automatic real-time lip-synchronization from audio solution that integrates seamlessly into existing animation pipelines. We evaluate our results by: cross-validation to ground-truth data; animator critique and edits; visual comparison to recent deep-learning lip-synchronization solutions; and showing our approach to be resilient to diversity in speaker and language.

References:

1. Robert Anderson, Björn Stenger, Vincent Wan, and Roberto Cipolla. 2013. Expressive Visual Text-to-Speech Using Active Appearance Models. In Proc. CVPR. Google ScholarDigital Library
2. Gérard Bailly. 1997. Learning to speak. Sensori-motor control of speech movements. Speech Communication 22, 2-3 (1997). Google ScholarDigital Library
3. Gérard Bailly, Pascal Perrier, and Eric Vatikiotis-Bateson. 2012. Audiovisual Speech Processing. Cambridge University Press.Google Scholar
4. Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving Visual Speech with Audio. In Proc. SIGGRAPH. Google ScholarDigital Library
5. Carlos Busso, Sungbok Lee, and Shrikanth S. Narayanan. 2007. Using neutral speech models for emotional speech analysis. In Proc. InterSpeech.Google Scholar
6. Rich Caruana. 1997. Multi-task Learning. Machine Learning 28, 1 (1997). Google ScholarDigital Library
7. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Y. Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. EMNLP.Google Scholar
8. Michael M Cohen and Dominic W Massaro. 1993. Modeling Coarticulation in Synthetic Visual Speech. Models and Techniques in Computer Animation 92 (1993).Google Scholar
9. Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition. J. Acoustical Society of America 120, 5 (2006).Google ScholarCross Ref
10. Steven B. Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoustics, Speech and Signal Processing 28, 4 (1980).Google ScholarCross Ref
11. Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: An Animator-centric Viseme Model for Expressive Lip Synchronization. ACM Trans. Graphics. 35, 4 (2016). Google ScholarDigital Library
12. Paul Ekman and Wallace V. Friesen. 1978. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press.Google Scholar
13. Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic Speech Animation. In Proc. SIGGRAPH. Google ScholarDigital Library
14. Faceware. 2017. Analyzer. http://facewaretech.com/products/software/analyzer. (2017).Google Scholar
15. G. Fanelli, J. Gall, H. Romsdorfer, T. Weise, and L. Van Gool. 2010. A 3-D Audio-Visual Corpus of Affective Communication. IEEE Trans. Multimedia 12, 6 (2010). Google ScholarDigital Library
16. Cletus G Fisher. 1968. Confusions among visually perceived consonants. J. Speech, Language, and Hearing Research 11, 4 (1968).Google Scholar
17. Jennifer MB Fugate. 2013. Categorical perception for emotional faces. Emotion Review 5, 1 (2013).Google ScholarCross Ref
18. Google. 2017. Google Cloud Voice. https://cloud.google.com/speech. (2017).Google Scholar
19. Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proc. ICML. Google ScholarDigital Library
20. Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005). Google ScholarDigital Library
21. S. Haq and P.J.B. Jackson. 2009. Speaker-dependent audio-visual emotion recognition. In Proc. AVSP.Google Scholar
22. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997). Google ScholarDigital Library
23. Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. 2017. Avatar Digitization from a Single Image for Real-time Rendering. ACM Trans. Graphics. 36, 6 (2017). Google ScholarDigital Library
24. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2016. Image-to-image Translation with Conditional Adversarial Networks. arxiv abs/1611.07004 (2016).Google Scholar
25. Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven Facial Animation by Joint End-to-end Learning of Pose and Emotion. ACM Trans. Graphics. 36, 4 (2017). Google ScholarDigital Library
26. Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit. J. of Machine Learning Research 10, Jul (2009). Google ScholarDigital Library
27. Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime Facial Animation with on-the-Fly Correctives. ACM Trans. Graphics 32, 4 (2013). Google ScholarDigital Library
28. Alvin M Liberman, Katherine Safford Harris, Howard S Hoffman, and Belver C Griffith. 1957. The discrimination of speech sounds within and across phoneme boundaries. J. Experimental Psychology 54, 5 (1957).Google ScholarCross Ref
29. Karl F. MacDorman, Robert D. Green, Chin-Chang Ho, and Clinton T Koch. 2009. Too real for comfort? Uncanny responses to computer generated faces. Computers in Human Behavior 25, 3 (2009). Google ScholarDigital Library
30. Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal Forced Aligner: trainable text-speech alignment using Kaldi. In Proc. Interspeech.Google ScholarCross Ref
31. Christopher Olah. 2015. Understanding LSTM Networks. http://colah.github.io/posts/2015-08-Understanding-LSTMs. (2015).Google Scholar
32. Kuldip K Paliwal. 1998. Spectral subband centroid features for speech recognition. In Proc. ICASSP.Google ScholarCross Ref
33. Roger Blanco i Ribera, Eduard Zell, J. P. Lewis, Junyong Noh, and Mario Botsch. 2017. Facial Retargeting with Automatic Range of Motion Alignment. ACM Trans. Graphics. 36, 4 (2017). Google ScholarDigital Library
34. Supasorn Suwajanakorn, Steven M. Seitz, and I. Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning Lip Sync from Audio. ACM Trans. Graphics 36, 4 (2017). Google ScholarDigital Library
35. Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A Deep Learning Approach for Generalized Speech Animation. ACM Trans. Graphics. 36, 4 (2017). Google ScholarDigital Library
36. Sarah L. Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic Units of Visual Speech. In Proc. SCA. Google ScholarDigital Library
37. Lijuan Wang, Wei Han, and Frank K Soong. 2012. High Quality Lip-Sync Animation for 3D Photo-Realistic Talking Head. In Proc. ICASSP.Google ScholarCross Ref
38. Wenwu Wang. 2010. Machine Audition: Principles, Algorithms and Systems: Principles, Algorithms and Systems. IGI Global. Google ScholarDigital Library
39. Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. 2011. Realtime performance-based facial animation. In ACM Trans. Graphics, Vol. 30. Google ScholarDigital Library
40. Lance Williams. 1990. Performance-driven Facial Animation. In Proc. SIGGRAPH.Google ScholarDigital Library

ACM Digital Library Publication:

Overview Page:

SIGGRAPH 2018: Technical Papers

“Visemenet: audio-driven animator-centric speech animation” by Zhou, Xu, Landreth, Kalogerakis, Maji, et al. …

Conference:

Type(s):

Entry Number: 161

Title:

Session/Category Title: Portraits & Speech

Presenter(s)/Author(s):

Moderator(s):

Abstract:

References:

ACM Digital Library Publication:

Overview Page:

Sponsored by: