“Synthesizing Obama: learning lip sync from audio” by Suwajanakorn, Seitz and Kemelmacher-Shlizerman

  • ©Supasorn Suwajanakorn, Steven Seitz, and Ira Kemelmacher-Shlizerman




    Synthesizing Obama: learning lip sync from audio

Session/Category Title: Speech and Facial Animation




    Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track. Our approach produces photorealistic results.


    1. Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and others. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).Google Scholar
    2. Robert Anderson, Björn Stenger, Vincent Wan, and Roberto Cipolla. 2013a. An expressive text-driven 3D talking head. In ACM SIGGRAPH 2013 Posters. ACM, 80. Google ScholarDigital Library
    3. Robert Anderson, Björn Stenger, Vincent Wan, and Roberto Cipolla. 2013b. Expressive visual text-to-speech using active appearance models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3382–3389. Google ScholarDigital Library
    4. Fabrice Bellard, M Niedermayer, and others. 2012. FFmpeg. Availabel from: http://ffm.peg.org (2012).Google Scholar
    5. Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2012. Tools for placing cuts and transitions in interview video. ACM Trans. Graph. 31, 4 (2012), 67–1. Google ScholarDigital Library
    6. G. Bradski. 2000. Dr. Dobb’s Journal of Software Tools (2000).Google Scholar
    7. Matthew Brand. 1999. Voice Puppetry. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’99). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 21–28. Google ScholarDigital Library
    8. Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video rewrite: Driving visual speech with audio. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 353–360. Google ScholarDigital Library
    9. Peter J Burt and Edward H Adelson. 1983. A multiresolution spline with application to image mosaics. ACM Transactions on Graphics (TOG) 2, 4 (1983), 217–236. Google ScholarDigital Library
    10. Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016. Real-time facial animation with image-based dynamic avatars. ACM Transactions on Graphics (TOG) 35, 4 (2016), 126.Google ScholarDigital Library
    11. Yong Cao, Wen C Tien, Petros Faloutsos, and Frédéric Pighin. 2005. Expressive speech-driven facial animation. ACM Transactions on Graphics (TOG) 24, 4 (2005), 1283–1302. Google ScholarDigital Library
    12. Timothy F Cootes, Gareth J Edwards, Christopher J Taylor, and others. 2001. Active appearance models. IEEE Transactions on pattern analysis and machine intelligence 23, 6(2001), 681–685.Google ScholarDigital Library
    13. Kevin Dale, Kalyan Sunkavalli, Micah K Johnson, Daniel Vlasic, Wojciech Matusik, and Hanspeter Pfister. 2011. Video face replacement. ACM Transactions on Graphics (TOG) 30, 6 (2011), 130.Google ScholarDigital Library
    14. Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable videorealistic speech animation. Vol. 21. ACM. Google ScholarDigital Library
    15. Bo Fan, Lijuan Wang, Frank K Soong, and Lei Xie. 2015a. Photo-real talking head with deep bidirectional LSTM. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4884–4888. Google ScholarCross Ref
    16. Bo Fan, Lei Xie, Shan Yang, Lijuan Wang, and Frank K Soong. 2015b. A deep bidirectional LSTM approach for video-realistic talking head. Multimedia Tools and Applications (2015), 1–23.Google Scholar
    17. Shengli Fu, Ricardo Gutierrez-Osuna, Anna Esposito, Praveen K Kakumanu, and Oscar N Garcia. 2005. Audio/visual mapping with cross-modal hidden Markov models. IEEE Transactions on Multimedia 7, 2 (2005), 243–252.Google ScholarDigital Library
    18. Yarin Gal. 2015. A theoretically grounded application of dropout in recurrent neural networks. arXiv preprint arXiv:1512.05287 (2015).Google Scholar
    19. Pablo Garrido, Levi Valgaerts, Ole Rehmsen, Thorsten Thormahlen, Patrick Perez, and Christian Theobalt. 2014. Automatic face reenactment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4217–4224. Google ScholarDigital Library
    20. Pablo Garrido, Levi Valgaerts, Hamid Sarmadi, Ingmar Steiner, Kiran Varanasi, Patrick Perez, and Christian Theobalt. 2015. Vdub: Modifying face video of actors for plausible visual alignment to a dubbed audio track. In Computer Graphics Forum, Vol. 34. Wiley Online Library, 193–204.Google Scholar
    21. Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013).Google Scholar
    22. Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 273–278. Google ScholarCross Ref
    23. Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005), 602–610. Google ScholarDigital Library
    24. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780. Google ScholarDigital Library
    25. Masahide Kawai, Tomoyori Iwao, Daisuke Mima, Akinobu Maejima, and Shigeo Morishima. 2014. Data-driven speech animation synthesis focusing on realistic inside of the mouth. Journal of information processing 22, 2 (2014), 401–409.Google ScholarCross Ref
    26. Ira Kemelmacher-Shlizerman and Steven M Seitz. 2012. Collection flow. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 1792–1799.Google ScholarCross Ref
    27. Taehwan Kim, Yisong Yue, Sarah Taylor, and Iain Matthews. 2015. A decision tree framework for spatiotemporal sequence prediction. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 577–586. Google ScholarDigital Library
    28. Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research 10 (2009), 1755–1758.Google ScholarDigital Library
    29. Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
    30. Kai Li, Feng Xu, Jue Wang, Qionghai Dai, and Yebin Liu. 2012. A data-driven approach for facial expression synthesis in video. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 57–64.Google ScholarDigital Library
    31. Shu Liang, Linda G Shapiro, and Ira Kemelmacher-Shlizerman. 2016. Head reconstruction from internet photos. In European Conference on Computer Vision. Springer, 360–374.Google ScholarCross Ref
    32. Ce Liu, William T Freeman, Edward H Adelson, and Yair Weiss. 2008. Human-assisted motion annotation. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 1–8.Google ScholarCross Ref
    33. Wesley Mattheyses, Lukas Latacz, and Werner Verhelst. 2013. Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis. Speech Communication 55, 7 (2013), 857–876. Google ScholarDigital Library
    34. Wesley Mattheyses and Werner Verhelst. 2015. Audiovisual speech synthesis: An overview of the state-of-the-art. Speech Communication 66 (2015), 182–217. Google ScholarDigital Library
    35. Aude Oliva, Antonio Torralba, and Philippe G. Schyns. 2006. Hybrid Images. ACM Trans. Graph. 25, 3 (July 2006), 527–532. Google ScholarDigital Library
    36. Wener Robitza. 2016. ffmpeg-normalize. https://github.com/slhck/ffmpeg-normalize. (2016).Google Scholar
    37. Shunsuke Saito, Tianye Li, and Hao Li. 2016. Real-Time Facial Segmentation and Performance Capture from RGB Input. arXiv preprint arXiv:1604.02647 (2016).Google Scholar
    38. Shinji Sako, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, and Tadashi Kitamura. 2000. HMM-based text-to-audio-visual speech synthesis.. In INTERSPEECH. 25–28.Google Scholar
    39. YiChang Shih, Sylvain Paris, Connelly Barnes, William T Freeman, and Frédo Durand. 2014. Style transfer for headshot portraits. (2014).Google Scholar
    40. Taiki Shimba, Ryuhei Sakurai, Hirotake Yamazoe, and Joo-Ho Lee. 2015. Talking heads synthesis from audio with deep neural networks. In 2015 IEEE/SICE International Symposium on System Integration (SII). IEEE, 100–105. Google ScholarCross Ref
    41. Supasorn Suwajanakorn, Ira Kemelmacher-Shlizerman, and Steven M Seitz. 2014. Total moving face reconstruction. In European Conference on Computer Vision. Springer, 796–812. Google ScholarCross Ref
    42. Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2015. What Makes Tom Hanks Look Like Tom Hanks. In Proceedings of the IEEE International Conference on Computer Vision. 3952–3960.Google ScholarDigital Library
    43. Sarah Taylor, Akihiro Kato, Ben Milner, and Iain Matthews. 2016. Audio-to-Visual Speech Conversion using Deep Neural Networks. (2016).Google Scholar
    44. Sarah L Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic units of visual speech. In Proceedings of the 11th ACM SIGGRAPH/Eurographics conference on Computer Animation. Eurographics Association, 275–284.Google Scholar
    45. Alexandru Telea. 2004. An image inpainting technique based on the fast marching method. Journal of graphics tools 9, 1 (2004), 23–34. Google ScholarCross Ref
    46. Justus Thies, Michael Zollhöfer, Matthias Nießner, Levi Valgaerts, Marc Stamminger, and Christian Theobalt. 2015. Real-time expression transfer for facial reenactment. ACM Transactions on Graphics (TOG) 34, 6 (2015), 183.Google ScholarDigital Library
    47. Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2016. Face2face: Real-time face capture and reenactment of rgb videos. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE 1 (2016).Google ScholarCross Ref
    48. Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).Google Scholar
    49. Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popović. 2005. Face transfer with multilinear models. In ACM Transactions on Graphics (TOG), Vol. 24. ACM, 426–433. Google ScholarDigital Library
    50. Lijuan Wang, Wei Han, and Frank K Soong. 2012. High quality lip-sync animation for 3D photo-realistic talking head. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4529–4532. Google ScholarCross Ref
    51. Lijuan Wang, Xiaojun Qian, Wei Han, and Frank K Soong. 2010. Synthesizing photo-real talking head via trajectory-guided sample selection.. In INTERSPEECH, Vol. 10. 446–449.Google Scholar
    52. Lei Xie and Zhi-Qiang Liu. 2007a. A coupled HMM approach to video-realistic speech animation. Pattern Recognition 40, 8 (2007), 2325–2340. Google ScholarDigital Library
    53. Lei Xie and Zhi-Qiang Liu. 2007b. Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Transactions on Multimedia 9, 3 (2007), 500–510. Google ScholarDigital Library
    54. Xuehan Xiong and Fernando De la Torre. 2013. Supervised descent method and its applications to face alignment. In Proceedings of the IEEE conference on computer vision and pattern recognition. 532–539. Google ScholarDigital Library
    55. Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014).Google Scholar
    56. Xinjian Zhang, Lijuan Wang, Gang Li, Frank Seide, and Frank K Soong. 2013. A new language independent, photo-realistic talking head driven by voice only.. In INTERSPEECH. 2743–2747.Google Scholar

ACM Digital Library Publication:

Overview Page: