“Real-time 3D neural facial animation from binocular video” by Cao, Agrawal, Torre, Chen, Saragih, et al. …

  • ©Chen Cao, Vasu Agrawal, Francisco De La Torre, Lele Chen, Jason Saragih, Tomas Simon, and Yaser Sheikh




    Real-time 3D neural facial animation from binocular video



    We present a method for performing real-time facial animation of a 3D avatar from binocular video. Existing facial animation methods fail to automatically capture precise and subtle facial motions for driving a photo-realistic 3D avatar “in-the-wild” (i.e., variability in illumination, camera noise). The novelty of our approach lies in a light-weight process for specializing a personalized face model to new environments that enables extremely accurate real-time face tracking anywhere. Our method uses a pre-trained high-fidelity personalized model of the face that we complement with a novel illumination model to account for variations due to lighting and other factors often encountered in-the-wild (e.g., facial hair growth, makeup, skin blemishes). Our approach comprises two steps. First, we solve for our illumination model’s parameters by applying analysis-by-synthesis on a short video recording. Using the pairs of model parameters (rigid, non-rigid) and the original images, we learn a regression for real-time inference from the image space to the 3D shape and texture of the avatar. Second, given a new video, we fine-tune the real-time regression model with a few-shot learning strategy to adapt the regression model to the new environment. We demonstrate our system’s ability to precisely capture subtle facial motions in unconstrained scenarios, in comparison to competing methods, on a diverse collection of identities, expressions, and real-world environments.


    1. Sameer Agarwal, Keir Mierle, and Others. 2010. Ceres Solver. http://ceres-solver.org.Google Scholar
    2. Timur Bagautdinov, Chenglei Wu, Jason Saragih, Pascal Fua, and Yaser Sheikh. 2018. Modeling Facial Geometry Using Compositional VAEs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
    3. T. Baltrušaitis, P. Robinson, and L. Morency. 2012. 3D Constrained Local Model for rigid and non-rigid facial tracking. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. 2610–2617. Google ScholarCross Ref
    4. Volker Blanz, Curzio Basso, Tomaso Poggio, and Thomas Vetter. 2003. Reanimating Faces in Images and Video. Comput. Graph. Forum 22 (09 2003), 641–650. Google ScholarCross Ref
    5. Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. 187–194.Google ScholarDigital Library
    6. James Booth, Anastasios Roussos, Allan Ponniah, David Dunaway, and Stefanos Zafeiriou. 2018. Large scale 3D morphable models. International Journal of Computer Vision 126, 2-4 (2018), 233–254.Google ScholarDigital Library
    7. Sofien Bouaziz, Yangang Wang, and Mark Pauly. 2013. Online Modeling for Realtime Facial Animation. ACM Transactions on Graphics (TOG) 32, 4 (July 2013), 10.Google ScholarDigital Library
    8. Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. 2015. Real-Time High-Fidelity Facial Performance Capture. ACM Trans. Graph. 34, 4, Article 46 (July 2015), 9 pages. Google ScholarDigital Library
    9. Chen Cao, Menglei Chai, Oliver Woodford, and Linjie Luo. 2018. Stabilized Real-Time Face Tracking via a Learned Dynamic Rigidity Prior. ACM Trans. Graph. 37, 6, Article 233 (Dec. 2018), 11 pages.Google ScholarDigital Library
    10. Chen Cao, Qiming Hou, and Kun Zhou. 2014. Displaced Dynamic Expression Regression for Real-time Facial Tracking and Animation. ACM Transactions on Graphics (TOG) 33, 4 (July 2014), 43:1–43:10.Google ScholarDigital Library
    11. Chen Cao, Yanlin Weng, Stephen Lin, and Kun Zhou. 2013a. 3D Shape Regression for Real-time Facial Animation. ACM Transactions on Graphics (TOG) 32, 4, Article 41 (July 2013), 10 pages.Google ScholarDigital Library
    12. Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. 2013b. Faceware-house: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics 20, 3 (2013), 413–425.Google Scholar
    13. Dan Casas, Oleg Alexander, Andrew W. Feng, Graham Fyffe, Ryosuke Ichikari, Paul Debevec, Rhuizhe Wang, Evan Suma, and Ari Shapiro. 2015. Rapid Photorealistic Blendshapes from Commodity RGB-D Sensors. In Proceedings of the 19th Symposium on Interactive 3D Graphics and Games (i3D ’15). Association for Computing Machinery, New York, NY, USA, 134. Google ScholarDigital Library
    14. Jin-xiang Chai, Jing Xiao, and Jessica Hodgins. 2003. Vision-Based Control of 3D Facial Animation. In Proceedings of the 2003 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (San Diego, California) (SCA ’03). Eurographics Association, Goslar, DEU, 193–206.Google Scholar
    15. Y. Chen, H. Wu, F. Shi, X. Tong, and J. Chai. 2013. Accurate and Robust 3D Facial Capture Using a Single RGBD Camera. In 2013 IEEE International Conference on Computer Vision. 3615–3622. Google ScholarDigital Library
    16. Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 2001. Active appearance models. IEEE Transactions on pattern analysis and machine intelligence 23, 6 (2001), 681–685.Google ScholarDigital Library
    17. Douglas Decarlo and Dimitris Metaxas. 2000. Optical flow constraints on deformable models with applications to face tracking. International Journal of Computer Vision 38, 2 (2000), 99–127.Google ScholarDigital Library
    18. Mingsong Dou, Philip Davidson, Sean Ryan Fanello, Sameh Khamis, Adarsh Kowdle, Christoph Rhemann, Vladimir Tankovich, and Shahram Izadi. 2017. Motion2Fusion: Real-time Volumetric Performance Capture. SIGGRAPH Asia (2017).Google Scholar
    19. Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, Pushmeet Kohli, Vladimir Tankovich, and Shahram Izadi. 2016. Fusion4D: Real-time Performance Capture of Challenging Scenes. SIGGRAPH (2016).Google Scholar
    20. Graham Fyffe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec. 2014. Driving High-Resolution Facial Scans with Video Performance Capture. ACM Transactions on Graphics (TOG) 34, 1, Article 8 (Dec. 2014), 8:1–8:14 pages.Google ScholarDigital Library
    21. Thomas Gerig, Andreas Morel-Forster, Clemens Blumer, Bernhard Egger, Marcel Luthi, Sandro Schönborn, and Thomas Vetter. 2018. Morphable face models-an open framework. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 75–82.Google ScholarDigital Library
    22. Patrik Huber, Guosheng Hu, Rafael Tena, Pouria Mortazavian, P Koppen, William J Christmas, Matthias Ratsch, and Josef Kittler. 2016. A multiresolution 3d morphable face model and fitting framework. In Proceedings of the 11th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications.Google ScholarCross Ref
    23. Jing Xiao, S. Baker, I. Matthews, and T. Kanade. 2004. Real-time combined 2D+3D active appearance models. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., Vol. 2. II–II. Google ScholarCross Ref
    24. F. Kahraman, M. Gokmen, S. Darkner, and R. Larsen. 2007. An Active Illumination and Appearance (AIA) Model for Face Alignment. In 2007 IEEE Conference on Computer Vision and Pattern Recognition. 1–7. Google ScholarCross Ref
    25. Vahid Kazemi and Josephine Sullivan. 2014. One Millisecond Face Alignment with an Ensemble of Regression Trees. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarDigital Library
    26. Hyeongwoo Kim, Mohamed Elgharib, Hans-Peter Zollöfer, Michael Seidel, Thabo Beeler, Christian Richardt, and Christian Theobalt. 2019. Neural Style-Preserving Visual Dubbing. ACM Transactions on Graphics (TOG) 38, 6 (2019), 178:1–13.Google ScholarDigital Library
    27. Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Nießner, Patrick Pérez, Christian Richardt, Michael Zollöfer, and Christian Theobalt. 2018. Deep Video Portraits. ACM Transactions on Graphics (TOG) 37, 4 (2018), 163.Google ScholarDigital Library
    28. Hyeongwoo Kim, Michael Zollhöfer, Ayush Tewari, Justus Thies, Christian Richardt, and Christian Theobalt. 2017. InverseFaceNet: Deep Single-Shot Inverse Face Rendering From A Single Image. CoRR abs/1703.10956 (2017). arXiv:1703.10956 http://arxiv.org/abs/1703.10956Google Scholar
    29. Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).Google Scholar
    30. Oliver Klehm, Fabrice Rousselle, Marios Papas, Derek Bradley, Christophe Hery, Bernd Bickel, Wojciech Jarosz, and Thabo Beeler. 2015. Recent Advances in Facial Appearance Capture. Computer Graphics Forum (Proceedings of Eurographics – State of the Art Reports) 34, 2 (May 2015), 709–733. https://doi.org/10/f7mb4bGoogle ScholarDigital Library
    31. Samuli Laine, Tero Karras, Timo Aila, Antti Herva, Shunsuke Saito, Ronald Yu, Hao Li, and Jaakko Lehtinen. 2017. Production-level Facial Performance Capture Using Deep Convolutional Neural Networks. In Proceedings of the ACM SIGGRAPH / Eurographics Symposium on Computer Animation. Article 10, 10 pages.Google ScholarDigital Library
    32. J. P. Lewis, Ken Anjyo, Taehyun Rhee, Mengjie Zhang, Fred Pighin, and Zhigang Deng. 2014. Practice and Theory of Blendshape Facial Models. In Eurographics 2014 – State of the Art Reports, Sylvain Lefebvre and Michela Spagnuolo (Eds.). The Eurographics Association. Google Scholar
    33. Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime Facial Animation with On-the-Fly Correctives. ACM Trans. Graph. 32, 4, Article 42 (July 2013), 10 pages. Google ScholarDigital Library
    34. Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. 2017. Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36, 6 (2017), 194–1.Google ScholarDigital Library
    35. Zhengqin Li, Zexiang Xu, Ravi Ramamoorthi, Kalyan Sunkavalli, and Manmohan Chandraker. 2018. Learning to reconstruct shape and spatially-varying reflectance from a single image. In SIGGRAPH Asia 2018 Technical Papers. ACM, 269.Google Scholar
    36. Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep Appearance Models for Face Rendering. ACM Trans. Graph. 37, 4, Article 68 (July 2018), 13 pages.Google ScholarDigital Library
    37. Ricardo Martin-Brualla, Rohit Pandey, Shuoran Yang, Pavel Pidlypenskyi, Jonathan Taylor, Julien Valentin, Sameh Khamis, Philip Davidson, Anastasia Tkach, Peter Lincoln, et al. 2018. Lookingood: Enhancing performance capture with real-time neural re-rendering. arXiv preprint arXiv:1811.05029 (2018).Google Scholar
    38. Iain Matthews and Simon Baker. 2004. Active appearance models revisited. International journal of computer vision 60, 2 (2004), 135–164.Google ScholarDigital Library
    39. S. McDonagh, M. Klaudiny, D. Bradley, T. Beeler, I. Matthews, and K. Mitchell. 2016. Synthetic Prior Design for Real-Time Face Tracking. In International Conference on 3D Vision (3DV). 639–648.Google Scholar
    40. Koki Nagano, Jaewoo Seo, Jun Xing, Lingyu Wei, Zimo Li, Shunsuke Saito, Aviral Agarwal, Jens Fursund, and Hao Li. 2018. PaGAN: Real-Time Avatars Using Dynamic Textures. ACM Trans. Graph. 37, 6, Article 258 (Dec. 2018), 12 pages. Google ScholarDigital Library
    41. Kyle Olszewski, Joseph J. Lim, Shunsuke Saito, and Hao Li. 2016. High-fidelity Facial and Speech Animation for VR HMDs. ACM Transactions on Graphics (TOG) 35, 6, Article 221 (Nov. 2016), 14 pages.Google ScholarDigital Library
    42. Sergio Orts-Escolano, Christoph Rhemann, Sean Fanello, Wayne Chang, Adarsh Kowdle, Yury Degtyarev, David Kim, Philip L. Davidson, Sameh Khamis, Mingsong Dou, Vladimir Tankovich, Charles Loop, Qin Cai, Philip A. Chou, Sarah Mennicken, Julien Valentin, Vivek Pradeep, Shenlong Wang, Sing Bing Kang, Pushmeet Kohli, Yuliya Lutchyn, Cem Keskin, and Shahram Izadi. 2016. Holoportation: Virtual 3D Teleportation in Real-time. In UIST.Google Scholar
    43. Rohit Pandey, Anastasia Tkach, Shuoran Yang, Pavel Pidlypenskyi, Jonathan Taylor, Ricardo Martin-Brualla, Andrea Tagliasacchi, George Papandreou, Philip Davidson, Cem Keskin, et al. 2019. Volumetric capture of humans with a single rgbd camera via semi-parametric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9709–9718.Google ScholarCross Ref
    44. J. M. Saragih, S. Lucey, and J. F. Cohn. 2011. Real-time avatar animation from a single image. In 2011 IEEE International Conference on Automatic Face Gesture Recognition (FG). 213–220. Google ScholarCross Ref
    45. Gabriel Schwartz, Shih-En Wei, Te-Li Wang, Stephen Lombardi, Tomas Simon, Jason Saragih, and Yaser Sheikh. 2020. The Eyes Have It: An Integrated Eye and Face Model for Photorealistic Facial Animation. ACM Trans. Graph. 39, 4, Article 91 (July 2020), 15 pages.Google ScholarDigital Library
    46. Soumyadip Sengupta, Angjoo Kanazawa, Carlos D Castillo, and David W Jacobs. 2018. SfSNet: Learning Shape, Reflectance and Illuminance of Facesin the Wild’. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6296–6305.Google ScholarCross Ref
    47. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
    48. J. Rafael Tena, Fernando De la Torre, and Iain Matthews. 2011. Interactive Region-Based Linear 3D Face Models. Association for Computing Machinery, New York, NY, USA. Google ScholarDigital Library
    49. Ayush Tewari, Michael Zollöfer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard, Patrick Perez, and Christian Theobalt. 2017. MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction. In IEEE International Conference on Computer Vision (ICCV).Google Scholar
    50. J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, M. Stamminger, and C. Theobalt. 2015. Real-time Expression Transfer for Facial Reenactment. ACM Transactions on Graphics (TOG) 34, 6 (2015).Google ScholarDigital Library
    51. Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2016. Face2Face: Real-time Face Capture and Reenactment of RGB Videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarDigital Library
    52. Justus Thies, Michael Zollhöfer, Christian Theobalt, Marc Stamminger, and Matthias Niessner. 2018. HeadOn: Real-Time Reenactment of Human Portrait Videos. ACM Trans. Graph. 37, 4, Article 164 (July 2018), 13 pages. Google ScholarDigital Library
    53. Anh Tuan Tran, Tal Hassner, Iacopo Masi, and Gerard Medioni. 2017. Regressing Robust and Discriminative 3D Morphable Models with a very Deep Neural Network. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
    54. Luan Tran, Feng Liu, and Xiaoming Liu. 2019. Towards High-fidelity Nonlinear 3D Face Morphable Model. In In Proceeding of IEEE Computer Vision and Pattern Recognition. Long Beach, CA.Google Scholar
    55. Luan Tran and Xiaoming Liu. 2018. Nonlinear 3D Face Morphable Model. In In Proceeding of IEEE Computer Vision and Pattern Recognition. Salt Lake City, UT.Google Scholar
    56. Georgios Tzimiropoulos, Joan Alabort-i Medina, Stefanos Zafeiriou, and Maja Pantic. 2013. Generic Active Appearance Models Revisited. Springer Berlin Heidelberg, Berlin, Heidelberg, 650–663.Google Scholar
    57. Levi Valgaerts, Chenglei Wu, Andrés Bruhn, Hans-Peter Seidel, and Christian Theobalt. 2012. Lightweight Binocular Facial Performance Capture under Uncontrolled Lighting. In ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2012), Vol. 31. 187:1–187:11. Google ScholarDigital Library
    58. Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popović. 2005. Face Transfer with Multilinear Models. ACM Trans. Graph. 24, 3 (July 2005), 426–433. Google ScholarDigital Library
    59. Shih-En Wei, Jason Saragih, Tomas Simon, Adam W. Harley, Stephen Lombardi, Michal Perdoch, Alexander Hypes, Dawei Wang, Hernan Badino, and Yaser Sheikh. 2019. VR Facial Animation via Multiview Image Translation. ACM Trans. Graph. 38, 4, Article 67 (July 2019), 16 pages. Google ScholarDigital Library
    60. Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. 2011. Realtime Performance-Based Facial Animation. Association for Computing Machinery, New York, NY, USA. Google ScholarDigital Library
    61. Xuehan Xiong and Fernando De la Torre. 2013. Supervised Descent Method and Its Applications to Face Alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarDigital Library
    62. Jae Shin Yoon, Takaaki Shiratori, Shoou-I Yu, and Hyun Soo Park. 2019. Self-supervised adaptation of high-fidelity face models for monocular performance tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4601–4609.Google ScholarCross Ref
    63. Alan Yuille and Daniel Kersten. 2006. Vision as Bayesian inference: analysis by synthesis? Trends in cognitive sciences 10, 7 (2006), 301–308.Google Scholar
    64. Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE International Conference on Computer Vision. 9459–9468.Google ScholarCross Ref
    65. Richard Zhang. 2019. Making convolutional networks shift-invariant again. arXiv preprint arXiv:1904.11486 (2019).Google Scholar
    66. Michael Zollhöfer, Justus Thies, Pablo Garrido, Derek Bradley, Thabo Beeler, Patrick Pérez, Marc Stamminger, Matthias Nießner, and Christian Theobalt. 2018. State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications. Computer Graphics Forum (2018). Google ScholarCross Ref

ACM Digital Library Publication:

Overview Page: