“A deep learning approach for generalized speech animation”

  • ©Sarah Taylor, Taehwan Kim, Moshe Mahler, James Krahe, Jessica K. Hodgins, Yisong Yue, and Iain Matthews




    A deep learning approach for generalized speech animation


Session Title: Speech and Facial Animation



    We introduce a simple and effective deep learning approach to automatically generate natural looking speech animation that synchronizes to input speech. Our approach uses a sliding window predictor that learns arbitrary nonlinear mappings from phoneme label input sequences to mouth movements in a way that accurately captures natural motion and visual coarticulation effects. Our deep learning approach enjoys several attractive properties: it runs in real-time, requires minimal parameter tuning, generalizes well to novel input speech sequences, is easily edited to create stylized and emotional speech, and is compatible with existing animation retargeting approaches. One important focus of our work is to develop an effective approach for speech animation that can be easily integrated into existing production pipelines. We provide a detailed description of our end-to-end approach, including machine learning design decisions. Generalized speech animation results are demonstrated over a wide range of animation clips on a variety of characters and voices, including singing and foreign language input. Our approach can also generate on-demand speech animation in real-time from user speech input.


    1. Robert Anderson, Bjorn Stenger, Vincent Wan, and Roberto Cipolla. 2013. Expressive Visual Text-To-Speech Using Active Appearance Models. In Proccedings of the International Conference on Computer Vision and Pattern Recognition. 3382–3389. Google ScholarDigital Library
    2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
    3. Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. 2012. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. (2012).Google Scholar
    4. Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman, Robert W Sumner, and Markus Gross. 2011. High-quality passive facial performance capture using anchor frames. ACM Transactions on Graphics 30 (Aug. 2011), 75:1–75:10. Issue 4.Google Scholar
    5. Matthew Brand. 1999. Voice Puppetry. In Proceedings of SIGGRAPH. ACM, 21–28. Google ScholarDigital Library
    6. Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving Visual Speech with Audio. In Proceedings of SIGGRAPH. 353–360. Google ScholarDigital Library
    7. Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. 2015. Real-time high-fidelity facial performance capture. ACM Transactions on Graphics 34, 4 (2015), 46.Google ScholarDigital Library
    8. Chen Cao, Yanlin Weng, Stephen Lin, and Kun Zhou. 2013. 3D Shape Regression for Real-time Facial Animation. ACM Transactions on Graphics 32, 4 (2013), 41:1–41:10.Google ScholarDigital Library
    9. Yong Cao, Wen C Tien, Petros Faloutsos, and Frédéric Pighin. 2005. Expressive Speech-Driven Facial Animation. ACM Transactions on Graphics 24, 4 (2005), 1283 — 1302. Google ScholarDigital Library
    10. Rich Caruana and Alexandru Niculescu-Mizil. 2006. An empirical comparison of supervised learning algorithms. In International Conference on Machine Learning (ICML). 161–168. Google ScholarDigital Library
    11. Michael M Cohen, Dominic W Massaro, and others. 1994. Modeling Coarticualtion in Synthetic Visual Speech. In Models and Techniques in Computer Animation, N.M. Thalmann and Thalmann D (Eds.). Springer-Verlag, 141–155.Google Scholar
    12. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, Aug (2011), 2493–2537.Google ScholarDigital Library
    13. Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 2001. Active Appearance Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 6 (2001), 681–685. Google ScholarDigital Library
    14. Eric Cosatto and Hans Peter Graf. 2000. Photo-realistic Talking-heads from Image Samples. IEEE Transactions on Multimedia 2, 3 (2000), 152–163. Google ScholarDigital Library
    15. José Mario De Martino, Léo Pini Magalhães, and Fábio Violaro. 2006. Facial animation based on context-dependent visemes. Journal of Computers and Graphics 30, 6 (2006), 971 — 980. Google ScholarDigital Library
    16. Salil Deena, Shaobo Hou, and Aphrodite Galata. 2010. Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching state-space model. In Proceedings of the International Conference on Multimodal Interfaces. 1–8. Google ScholarDigital Library
    17. Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics (TOG) 35, 4 (2016), 127.Google ScholarDigital Library
    18. Gwenn Englebienne, Timothy F Cootes, and Magnus Rattray. 2007. A Probabilistic Model for Generating Realistic Speech Movements from Speech. In Proceedings of Advances in Natural Information Processing Systems. 401–408.Google Scholar
    19. Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic Speech Animation. In ACM Transactions on Graphics. 388–398. Google ScholarDigital Library
    20. Bo Fan, Lijuan Wang, Frank K Soong, and Lei Xie. 2015. Photo-real Talking Head with Deep Bidirectional LSTM. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. IEEE, 4884–4888.Google ScholarCross Ref
    21. Shengli Fu, Ricardo Gutierrez-Osuna, Anna Esposito, Praveen K Kakumanu, and Oscar N Garcia. 2005. Audio/visual mapping with cross-modal hidden Markov models. IEEE Transactions on Multimedia 7, 2 (2005), 243–252.Google ScholarDigital Library
    22. Graham Fyfe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec. 2014. Driving High-Resolution Facial Scans with Video Performance Capture. ACM Transactions on Graphics 34, 1 (2014), 8.Google Scholar
    23. John S Garofolo, Lori F Lamel, William M Fisher, Jonathon G Fiscus, and David S Pallett. 1993. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM TIMIT. Technical Report 4930. NIST.Google Scholar
    24. Oxana Govokhina, Gérard Bailly, Gaspard Breton, and Paul Bagshaw. 2006. TDA: A new trainable trajectory formation system for facial animation. In Proceedings of Interspeech. 2474–2477.Google Scholar
    25. Alex Graves and Navdeep Jaitly. 2014. Towards End-To-End Speech Recognition with Recurrent Neural Networks. In ICML, Vol. 14. 1764–1772.Google ScholarDigital Library
    26. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780. Google ScholarDigital Library
    27. Haoda Huang, Jinxiang Chai, Xin Tong, and Hsiang-Tao Wu. 2011. Leveraging motion capture and 3d scanning for high-fidelity facial performance acquisition. In ACM Transactions on Graphics, Vol. 30. ACM, 74. Google ScholarDigital Library
    28. Taehwan Kim, Yisong Yue, Sarah Taylor, and Iain Matthews. 2015. A Decision Tree Framework for Spatiotemporal Sequence Prediction. In ACM Conference on Knowledge Discovery and Data Mining. 577–586. Google ScholarDigital Library
    29. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems. 1097–1105.Google Scholar
    30. Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime Facial Animation with On-the-fly Correctives. ACM Transactions on Graphics 32, 4 (2013), 42–1. Google ScholarDigital Library
    31. Kang Liu and Joern Ostermann. 2012. Evaluation of an image-based talking head with realistic facial expression and head motion. Multimodal User Interfaces 5 (2012), 37–44. Google ScholarCross Ref
    32. Changwei Luo, Jun Yu, Xian Li, and Zengfu Wang. 2014. Realtime speech-driven facial animation using Gaussian Mixture Models. In IEEE Conference on Multimedia and Expo Workshops. 1–6.Google Scholar
    33. Jiyong Ma, Ron Cole, Bryan Pellom, Wayne Ward, and Barbara Wise. 2006. Accurate Visible Speech Synthesis Based on Concatenating Variable Length Motion Capture Data. IEEE Transactions on Visualization and Computer Graphics 12, 2 (2006), 266–276. Google ScholarDigital Library
    34. Iain Matthews and Simon Baker. 2004. Active Appearance Models Revisited. International Journal of Computer Vision 60, 2 (2004), 135–164. Google ScholarDigital Library
    35. Wesley Mattheyses, Lukas Latacz, and Werner Verhelst. 2013. Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis. Speech Communication 55, 7–8 (2013), 857–876.Google Scholar
    36. Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264 (Dec. 1976), 746–748. Google ScholarCross Ref
    37. Thomas Merritt and Simon King. 2013. Investigating the shortcomings of HMM synthesis. In ISCA Workshop on Speech Synthesis. 185–190.Google Scholar
    38. Julian James Odell. 1995. The Use of Context in Large Vocabulary Speech Recognition. Ph.D. Dissertation. Cambridge University.Google Scholar
    39. Dietmar Schabus, Michael Pucher, and Gregor Hofer. 2011. Simultaneous Speech and Animation Synthesis. In ACM SIGGRAPH Posters. 8:1–8:1. Google ScholarDigital Library
    40. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR) 15, 1 (2014), 1929–1958.Google ScholarDigital Library
    41. Robert W Sumner and Jovan Popović. 2004. Deformation transfer for triangle meshes. ACM Transactions on Graphics 23, 3 (2004), 399–405. Google ScholarDigital Library
    42. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Neural Information Processing Systemsw. 3104–3112.Google Scholar
    43. Sarah L Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic Units of Visual Speech. In Proceedings of ACM SIGGRAPH/Eurographics Symposium on Computer Animation. Eurographics Association, 275–284.Google Scholar
    44. Barry-John Theobald and Iain Matthews. 2012. Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers. IEEE Transactions on Audio, Speech and Language Processing 20, 8 (2012), 2378.Google ScholarDigital Library
    45. Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).Google Scholar
    46. Lijuan Wang, Wei Han, and Frank K Soong. 2012. High quality lip-sync animation for 3D photo-realistic talking head. In IEEE Conference on Acoustics, Speech and Signal Processing. IEEE, 4529–4532. Google ScholarCross Ref
    47. Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. 2011. Realtime Performance-based Facial Animation. In ACM Transactions on Graphics (TOG), Vol. 30. 77:1–77:10. Google ScholarDigital Library
    48. Yanlin Weng, Chen Cao, Qiming Hou, and Kun Zhou. 2014. Real-time facial animation on mobile devices. Graphical Models 76, 3 (2014), 172–179. Google ScholarCross Ref
    49. Lei Xie and Zhi-Qiang Liu. 2007. A coupled HMM approach to video-realistic speech animation. Pattern Recognition 40, 8 (2007), 2325–2340. Google ScholarDigital Library
    50. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 2, 3 (2015), 5.Google Scholar
    51. Yuyu Xu, Andrew W Feng, Stacy Marsella, and Ari Shapiro. 2013. A Practical and Configurable Lip Sync Method for Games. In Proc. ACM SIGGRAPH Motion in Games. 131–140. Google ScholarDigital Library
    52. Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, and others. 2006. The HTK Book. Cambridge University.Google Scholar
    53. Jiahong Yuan and Mark Liberman. 2008. Speaker Identification on the SCOTUS Corpus. Journal of the Acoustical Society of America 123, 5 (2008). Google ScholarCross Ref
    54. Heiga Zen, Takashi Nose, Junichi Yamagishi, Shinji Sako, Takashi Masuko, Alan Black, and Keiichi Tokuda. 2007. The HMM-based speech synthesis system version 2.0. In Proceedings of the Speech Synthesis Workshop. 294–299.Google Scholar
    55. Li Zhang, Noah Snavely, Brian Curless, and Steven M Seitz. 2004. Spacetime Faces: High Resolution Capture for Modeling and Animation. In ACM Transactions on Graphics. 548–558.Google ScholarDigital Library

ACM Digital Library Publication: