A deep learning approach for generalized speech animation

We introduce a simple and effective deep learning approach to automatically generate natural looking speech animation that synchronizes to input speech. Our approach uses a sliding window predictor that learns arbitrary nonlinear mappings from phoneme label input sequences to mouth movements in a way that accurately captures natural motion and visual coarticulation effects. Our deep learning approach enjoys several attractive properties: it runs in real-time, requires minimal parameter tuning, generalizes well to novel input speech sequences, is easily edited to create stylized and emotional speech, and is compatible with existing animation retargeting approaches. One important focus of our work is to develop an effective approach for speech animation that can be easily integrated into existing production pipelines. We provide a detailed description of our end-to-end approach, including machine learning design decisions. Generalized speech animation results are demonstrated over a wide range of animation clips on a variety of characters and voices, including singing and foreign language input. Our approach can also generate on-demand speech animation in real-time from user speech input.

References:

1. Robert Anderson, Bjorn Stenger, Vincent Wan, and Roberto Cipolla. 2013. Expressive Visual Text-To-Speech Using Active Appearance Models. In Proccedings of the International Conference on Computer Vision and Pattern Recognition. 3382–3389. Google ScholarDigital Library
2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
3. Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. 2012. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. (2012).Google Scholar
4. Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman, Robert W Sumner, and Markus Gross. 2011. High-quality passive facial performance capture using anchor frames. ACM Transactions on Graphics 30 (Aug. 2011), 75:1–75:10. Issue 4.Google Scholar
5. Matthew Brand. 1999. Voice Puppetry. In Proceedings of SIGGRAPH. ACM, 21–28. Google ScholarDigital Library
6. Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving Visual Speech with Audio. In Proceedings of SIGGRAPH. 353–360. Google ScholarDigital Library
7. Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. 2015. Real-time high-fidelity facial performance capture. ACM Transactions on Graphics 34, 4 (2015), 46.Google ScholarDigital Library
8. Chen Cao, Yanlin Weng, Stephen Lin, and Kun Zhou. 2013. 3D Shape Regression for Real-time Facial Animation. ACM Transactions on Graphics 32, 4 (2013), 41:1–41:10.Google ScholarDigital Library
9. Yong Cao, Wen C Tien, Petros Faloutsos, and Frédéric Pighin. 2005. Expressive Speech-Driven Facial Animation. ACM Transactions on Graphics 24, 4 (2005), 1283 — 1302. Google ScholarDigital Library
10. Rich Caruana and Alexandru Niculescu-Mizil. 2006. An empirical comparison of supervised learning algorithms. In International Conference on Machine Learning (ICML). 161–168. Google ScholarDigital Library
11. Michael M Cohen, Dominic W Massaro, and others. 1994. Modeling Coarticualtion in Synthetic Visual Speech. In Models and Techniques in Computer Animation, N.M. Thalmann and Thalmann D (Eds.). Springer-Verlag, 141–155.Google Scholar
12. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, Aug (2011), 2493–2537.Google ScholarDigital Library
13. Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 2001. Active Appearance Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 6 (2001), 681–685. Google ScholarDigital Library
14. Eric Cosatto and Hans Peter Graf. 2000. Photo-realistic Talking-heads from Image Samples. IEEE Transactions on Multimedia 2, 3 (2000), 152–163. Google ScholarDigital Library
15. José Mario De Martino, Léo Pini Magalhães, and Fábio Violaro. 2006. Facial animation based on context-dependent visemes. Journal of Computers and Graphics 30, 6 (2006), 971 — 980. Google ScholarDigital Library
16. Salil Deena, Shaobo Hou, and Aphrodite Galata. 2010. Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching state-space model. In Proceedings of the International Conference on Multimodal Interfaces. 1–8. Google ScholarDigital Library
17. Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics (TOG) 35, 4 (2016), 127.Google ScholarDigital Library
18. Gwenn Englebienne, Timothy F Cootes, and Magnus Rattray. 2007. A Probabilistic Model for Generating Realistic Speech Movements from Speech. In Proceedings of Advances in Natural Information Processing Systems. 401–408.Google Scholar
19. Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic Speech Animation. In ACM Transactions on Graphics. 388–398. Google ScholarDigital Library
20. Bo Fan, Lijuan Wang, Frank K Soong, and Lei Xie. 2015. Photo-real Talking Head with Deep Bidirectional LSTM. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. IEEE, 4884–4888.Google ScholarCross Ref
21. Shengli Fu, Ricardo Gutierrez-Osuna, Anna Esposito, Praveen K Kakumanu, and Oscar N Garcia. 2005. Audio/visual mapping with cross-modal hidden Markov models. IEEE Transactions on Multimedia 7, 2 (2005), 243–252.Google ScholarDigital Library
22. Graham Fyfe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec. 2014. Driving High-Resolution Facial Scans with Video Performance Capture. ACM Transactions on Graphics 34, 1 (2014), 8.Google Scholar
23. John S Garofolo, Lori F Lamel, William M Fisher, Jonathon G Fiscus, and David S Pallett. 1993. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM TIMIT. Technical Report 4930. NIST.Google Scholar
24. Oxana Govokhina, Gérard Bailly, Gaspard Breton, and Paul Bagshaw. 2006. TDA: A new trainable trajectory formation system for facial animation. In Proceedings of Interspeech. 2474–2477.Google Scholar
25. Alex Graves and Navdeep Jaitly. 2014. Towards End-To-End Speech Recognition with Recurrent Neural Networks. In ICML, Vol. 14. 1764–1772.Google ScholarDigital Library
26. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780. Google ScholarDigital Library
27. Haoda Huang, Jinxiang Chai, Xin Tong, and Hsiang-Tao Wu. 2011. Leveraging motion capture and 3d scanning for high-fidelity facial performance acquisition. In ACM Transactions on Graphics, Vol. 30. ACM, 74. Google ScholarDigital Library
28. Taehwan Kim, Yisong Yue, Sarah Taylor, and Iain Matthews. 2015. A Decision Tree Framework for Spatiotemporal Sequence Prediction. In ACM Conference on Knowledge Discovery and Data Mining. 577–586. Google ScholarDigital Library
29. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems. 1097–1105.Google Scholar
30. Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime Facial Animation with On-the-fly Correctives. ACM Transactions on Graphics 32, 4 (2013), 42–1. Google ScholarDigital Library
31. Kang Liu and Joern Ostermann. 2012. Evaluation of an image-based talking head with realistic facial expression and head motion. Multimodal User Interfaces 5 (2012), 37–44. Google ScholarCross Ref
32. Changwei Luo, Jun Yu, Xian Li, and Zengfu Wang. 2014. Realtime speech-driven facial animation using Gaussian Mixture Models. In IEEE Conference on Multimedia and Expo Workshops. 1–6.Google Scholar
33. Jiyong Ma, Ron Cole, Bryan Pellom, Wayne Ward, and Barbara Wise. 2006. Accurate Visible Speech Synthesis Based on Concatenating Variable Length Motion Capture Data. IEEE Transactions on Visualization and Computer Graphics 12, 2 (2006), 266–276. Google ScholarDigital Library
34. Iain Matthews and Simon Baker. 2004. Active Appearance Models Revisited. International Journal of Computer Vision 60, 2 (2004), 135–164. Google ScholarDigital Library
35. Wesley Mattheyses, Lukas Latacz, and Werner Verhelst. 2013. Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis. Speech Communication 55, 7–8 (2013), 857–876.Google Scholar
36. Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264 (Dec. 1976), 746–748. Google ScholarCross Ref
37. Thomas Merritt and Simon King. 2013. Investigating the shortcomings of HMM synthesis. In ISCA Workshop on Speech Synthesis. 185–190.Google Scholar
38. Julian James Odell. 1995. The Use of Context in Large Vocabulary Speech Recognition. Ph.D. Dissertation. Cambridge University.Google Scholar
39. Dietmar Schabus, Michael Pucher, and Gregor Hofer. 2011. Simultaneous Speech and Animation Synthesis. In ACM SIGGRAPH Posters. 8:1–8:1. Google ScholarDigital Library
40. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR) 15, 1 (2014), 1929–1958.Google ScholarDigital Library
41. Robert W Sumner and Jovan Popović. 2004. Deformation transfer for triangle meshes. ACM Transactions on Graphics 23, 3 (2004), 399–405. Google ScholarDigital Library
42. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Neural Information Processing Systemsw. 3104–3112.Google Scholar
43. Sarah L Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic Units of Visual Speech. In Proceedings of ACM SIGGRAPH/Eurographics Symposium on Computer Animation. Eurographics Association, 275–284.Google Scholar
44. Barry-John Theobald and Iain Matthews. 2012. Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers. IEEE Transactions on Audio, Speech and Language Processing 20, 8 (2012), 2378.Google ScholarDigital Library
45. Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).Google Scholar
46. Lijuan Wang, Wei Han, and Frank K Soong. 2012. High quality lip-sync animation for 3D photo-realistic talking head. In IEEE Conference on Acoustics, Speech and Signal Processing. IEEE, 4529–4532. Google ScholarCross Ref
47. Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. 2011. Realtime Performance-based Facial Animation. In ACM Transactions on Graphics (TOG), Vol. 30. 77:1–77:10. Google ScholarDigital Library
48. Yanlin Weng, Chen Cao, Qiming Hou, and Kun Zhou. 2014. Real-time facial animation on mobile devices. Graphical Models 76, 3 (2014), 172–179. Google ScholarCross Ref
49. Lei Xie and Zhi-Qiang Liu. 2007. A coupled HMM approach to video-realistic speech animation. Pattern Recognition 40, 8 (2007), 2325–2340. Google ScholarDigital Library
50. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 2, 3 (2015), 5.Google Scholar
51. Yuyu Xu, Andrew W Feng, Stacy Marsella, and Ari Shapiro. 2013. A Practical and Configurable Lip Sync Method for Games. In Proc. ACM SIGGRAPH Motion in Games. 131–140. Google ScholarDigital Library
52. Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, and others. 2006. The HTK Book. Cambridge University.Google Scholar
53. Jiahong Yuan and Mark Liberman. 2008. Speaker Identification on the SCOTUS Corpus. Journal of the Acoustical Society of America 123, 5 (2008). Google ScholarCross Ref
54. Heiga Zen, Takashi Nose, Junichi Yamagishi, Shinji Sako, Takashi Masuko, Alan Black, and Keiichi Tokuda. 2007. The HMM-based speech synthesis system version 2.0. In Proceedings of the Speech Synthesis Workshop. 294–299.Google Scholar
55. Li Zhang, Noah Snavely, Brian Curless, and Steven M Seitz. 2004. Spacetime Faces: High Resolution Capture for Modeling and Animation. In ACM Transactions on Graphics. 548–558.Google ScholarDigital Library

ACM Digital Library Publication:

Overview Page:

SIGGRAPH 2017: Technical Papers

“A deep learning approach for generalized speech animation”

Conference:

Type(s):

Title:

Session/Category Title: Speech and Facial Animation

Presenter(s)/Author(s):

Moderator(s):

Abstract:

References:

ACM Digital Library Publication:

Overview Page:

Sponsored by: