“Deep video portraits” by Kim, Garrido, Tewari, Xu, Thies, et al. …

  • ©


Abstract:


    We present a novel approach that enables photo-realistic re-animation of portrait videos using only an input video. In contrast to existing approaches that are restricted to manipulations of facial expressions only, we are the first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor. The core of our approach is a generative neural network with a novel space-time architecture. The network takes as input synthetic renderings of a parametric face model, based on which it predicts photo-realistic video frames for a given target actor. The realism in this rendering-to-video transfer is achieved by careful adversarial training, and as a result, we can create modified target videos that mimic the behavior of the synthetically-created input. In order to enable source-to-target video re-animation, we render a synthetic target video with the reconstructed head animation parameters from a source video, and feed it into the trained network – thus taking full control of the target. With the ability to freely recombine source and target parameters, we are able to demonstrate a large variety of video rewrite applications without explicitly modeling hair, body or background. For instance, we can reenact the full head using interactive user-controlled editing, and realize high-fidelity visual dubbing. To demonstrate the high quality of our output, we conduct an extensive series of experiments and evaluations, where for instance a user study shows that our video edits are hard to detect.

References:


    1. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google Scholar
    2. Oleg Alexander, Mike Rogers, William Lambeth, Jen-Yuan Chiang, Wan-Chun Ma, Chuan-Chang Wang, and Paul Debevec. 2010. The Digital Emily Project: Achieving a Photorealistic Digital Actor. IEEE Computer Graphics and Applications 30, 4 (July/ August 2010), 20–31. Google ScholarDigital Library
    3. Hadar Averbuch-Elor, Daniel Cohen-Or, Johannes Kopf, and Michael F. Cohen. 2017. Bringing Portraits to Life. ACM Transactions on Graphics (SIGGRAPH Asia) 36, 6 (November 2017), 196:1–13. Google ScholarDigital Library
    4. Volker Blanz, Kristina Scherbaum, Thomas Vetter, and Hans-Peter Seidel. 2004. Exchanging Faces in Images. Computer Graphics Forum (Eurographics) 23, 3 (September 2004), 669–676.Google ScholarCross Ref
    5. Volker Blanz and Thomas Vetter. 1999. A Morphable Model for the Synthesis of 3D Faces. In Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 187–194. Google ScholarDigital Library
    6. James Booth, Anastasios Roussos, Allan Ponniah, David Dunaway, and Stefanos Zafeiriou. 2018. Large Scale 3D Morphable Models. International Journal of Computer Vision 126, 2 (April 2018), 233–254. Google ScholarDigital Library
    7. Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving Visual Speech with Audio. In Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 353–360. Google ScholarDigital Library
    8. Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. 2015. Real-time High-fidelity Facial Performance Capture. ACM Transactions on Graphics (SIGGRAPH) 34, 4 (July 2015), 46:1–9. Google ScholarDigital Library
    9. Chen Cao, Qiming Hou, and Kun Zhou. 2014a. Displaced Dynamic Expression Regression for Real-time Facial Tracking and Animation. ACM Transactions on Graphics (SIGGRAPH) 33, 4 (July 2014), 43:1–10. Google ScholarDigital Library
    10. Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. 2014b. FaceWarehouse: A 3D Facial Expression Database for Visual Computing. IEEE Transactions on Visualization and Computer Graphics 20, 3 (March 2014), 413–425. Google ScholarDigital Library
    11. Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016. Real-time Facial Animation with Image-based Dynamic Avatars. ACM Transactions on Graphics (SIGGRAPH) 35, 4 (July 2016), 126:1–12. Google ScholarDigital Library
    12. Yao-Jen Chang and Tony Ezzat. 2005. Transferable Videorealistic Speech Animation. In Symposium on Computer Animation (SCA). 143–151. Google ScholarDigital Library
    13. Qifeng Chen and Vladlen Koltun. 2017. Photographic Image Synthesis with Cascaded Refinement Networks. In International Conference on Computer Vision (ICCV). 1520–1529.Google Scholar
    14. Kevin Dale, Kalyan Sunkavalli, Micah K. Johnson, Daniel Vlasic, Wojciech Matusik, and Hanspeter Pfister. 2011. Video face replacement. ACM Transactions on Graphics (SIGGRAPH Asia) 30, 6 (December 2011), 130:1–10. Google ScholarDigital Library
    15. Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic Speech Animation. ACM Transactions on Graphics (SIGGRAPH) 21, 3 0uly 2002), 388–398. Google ScholarDigital Library
    16. Ohad Fried, Eli Shechtman, Dan B. Goldman, and Adam Finkelstein. 2016. Perspective-aware Manipulation of Portrait Photos. ACM Transactions on Graphics (SIGGRAPH) 35, 4 (July 2016), 128:1–10. Google ScholarDigital Library
    17. Graham Fyffe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec. 2014. Driving High-Re solution Facial Scans with Video Performance Capture. ACM Transactions on Graphics 34, 1 (December 2014), 8:1–14. Google ScholarDigital Library
    18. Yaroslav Ganin, Daniil Kononenko, Diana Sungatullina, and Victor Lempitsky. 2016. DeepWarp: Photorealistic Image Resynthesis for Gaze Manipulation. In European Conference on Computer Vision (ECCV). 311–326.Google Scholar
    19. Pablo Garrido, Levi Valgaerts, Ole Rehmsen, Thorsten Thormaehlen, Patrick Pérez, and Christian Theobalt. 2014. Automatic Face Reenactment. In Conference on Computer Vision and Pattern Recognition (CVPR). 4217–4224. Google ScholarDigital Library
    20. Pablo Garrido, Levi Valgaerts, Hamid Sarmadi, Ingmar Steiner, Kiran Varanasi, Patrick Perez, and Christian Theobalt. 2015. VDub: Modifying Face Video of Actors for Plausible Visual Alignment to a Dubbed Audio Track. Computer Graphics Forum (Eurographics) 34, 2 (May 2015), 193–204. Google ScholarDigital Library
    21. Pablo Garrido, Michael Zollhöfer, Dan Casas, Levi Valgaerts, Kiran Varanasi, Patrick Pérez, and Christian Theobalt. 2016. Reconstruction of Personalized 3D Face Rigs from Monocular Video. ACM Transactions on Graphics 35, 3 (June 2016), 28:1–15. Google ScholarDigital Library
    22. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems. Google ScholarDigital Library
    23. Geoffrey E. Hinton and Ruslan Salakhutdinov. 2006. Reducing the Dimensionality of Data with Neural Networks. Science 313, 5786 (July 2006), 504–507.Google ScholarCross Ref
    24. Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Iman Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. 2017. Avatar Digitization from a Single Image for Real-time Rendering. ACM Transactions on Graphics (SIGGRAPH Asia) 36, 6 (November 2017), 195:1–14. Google ScholarDigital Library
    25. Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly. 2015. Dynamic 3D Avatar Creation from Hand-held Video Input. ACM Transactions on Graphics (SIGGRAPH) 34, 4 (July 2015), 45:1–14. Google ScholarDigital Library
    26. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image Translation with Conditional Adversarial Networks. In Conference on Computer Vision and Pattern Recognition (CVPR). 5967–5976.Google Scholar
    27. Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In International Conference on Learning Representations (ICLR).Google Scholar
    28. Ira Kemelmacher-Shlizerman. 2013. Internet-Based Morphable Model. In International Conference on Computer Vision (ICCV). 3256–3263. Google ScholarDigital Library
    29. Ira Kemelmacher-Shlizerman, Aditya Sankar, Eli Shechtman, and Steven M. Seitz. 2010. Being John Malkovich. In European Conference on Computer Vision (ECCV). 341–353. Google ScholarDigital Library
    30. Ira Kemelmacher-Shlizerman, Eli Shechtman, Rahul Garg, and Steven M. Seitz. 2011. Exploring photobios. ACM Transactions on Graphics (SIGGRAPH) 30, 4 (August 2011), 61:1–10. Google ScholarDigital Library
    31. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR).Google Scholar
    32. Christoph Lassner, Gerard Pons-Moll, and Peter V. Gehler. 2017. A Generative Model of People in Clothing. In International Conference on Computer Vision (ICCV). 853–862.Google Scholar
    33. Hao Li, Laura Trutoiu, Kyle Olszewski, Lingyu Wei, Tristan Trutna, Pei-Lun Hsieh, Aaron Nicholls, and Chongyang Ma. 2015. Facial Performance Sensing Head-mounted Display. ACM Transactions on Graphics (SIGGRAPH) 34, 4 0uly 2015), 47:1–9. Google ScholarDigital Library
    34. Kai Li, Qionghai Dai, Ruiping Wang, Yebin Liu, Feng Xu, and Jue Wang. 2014. A Data-Driven Approach for Facial Expression Retargeting in Video. IEEE Transactions on Multimedia 16, 2 (February 2014), 299–310. Google ScholarDigital Library
    35. Kang Liu and Joern Ostermann. 2011. Realistic facial expression synthesis for an image-based talking head. In International Conference on Multimedia and Expo (ICME). Google ScholarDigital Library
    36. Ming-Yu Liu, Thomas Breuel, and Jan Kautz. 2017. Unsupervised Image-to-image Translation Networks. In Advances in Neural Information Processing Systems.Google Scholar
    37. Zicheng Liu, Ying Shan, and Zhengyou Zhang. 2001. Expressive Expression Mapping with Ratio Images. In Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 271–276. Google ScholarDigital Library
    38. Liqian Ma, Qianru Sun, Xu Jia, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. 2017. Pose Guided Person Image Generation. In Advances in Neural Information Processing Systems.Google Scholar
    39. Mehdi Mirza and Simon Osindero. 2014. Conditional Generative Adversarial Nets. (2014). https://arxiv.org/abs/1411.1784 arXiv:1411.1784.Google Scholar
    40. Kyle Olszewski, Zimo Li, Chao Yang, Yi Zhou, Ronald Yu, Zeng Huang, Sitao Xiang, Shunsuke Saito, Pushmeet Kohli, and Hao Li. 2017. Realistic Dynamic Facial Textures from a Single Image using GANs. In International Conference on Computer Vision (ICCV). 5439–5448.Google ScholarCross Ref
    41. Kyle Olszewski, Joseph J. Lim, Shunsuke Saito, and Hao Li. 2016. High-fidelity Facial and Speech Animation for VR HMDs. ACM Transactions on Graphics (SIGGRAPH Asia) 35, 6 (November 2016), 221:1–14. Google ScholarDigital Library
    42. Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In International Conference on Learning Representations (ICLR).Google Scholar
    43. Ravi Ramamoorthi and Pat Hanrahan. 2001. An efficient representation for irradiance environment maps. In Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 497–500. Google ScholarDigital Library
    44. Elad Richardson, Matan Sela, and Ron Kimmel. 2016. 3D Face Reconstruction by Learning from Synthetic Data. In International Conference on 3D Vision (3DV). 460–469.Google ScholarCross Ref
    45. Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel. 2017. Learning Detailed Face Reconstruction from a Single Image. In Conference on Computer Vision and Pattern Recognition (CVPR). 5553–5562.Google ScholarCross Ref
    46. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). 234–241.Google Scholar
    47. Joseph Roth, Yiying Tong Tong, and Xiaoming Liu. 2017. Adaptive 3D Face Reconstruction from Unconstrained Photo Collections. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 11 (November 2017), 2127–2141.Google ScholarDigital Library
    48. Jason M. Saragih, Simon Lucey, and Jeffrey F. Cohn. 2011. Real-time avatar animation from a single image. In International Conference on Automatic Face and Gesture Recognition (FG). 117–124.Google Scholar
    49. Matan Sela, Elad Richardson, and Ron Kimmel. 2017. Unrestricted Facial Geometry Reconstruction Using Image-to-image Translation. In International Conference on Computer Vision (ICCV). 1585–1594.Google Scholar
    50. Fuhao Shi, Hsiang-Tao Wu, Xin Tong, and Jinxiang Chai. 2014. Automatic Acquisition of High-fidelity Facial Performances Using Monocular Videos. ACM Transactions on Graphics (SIGGRAPH Asia) 33, 6 (November 2014), 222:1–13. Google ScholarDigital Library
    51. Robert W. Sumner and Jovan Popović. 2004. Deformation Transfer for Triangle Meshes. ACM Transactions on Graphics (SIGGRAPH) 23, 3 (August 2004), 399–405. Google ScholarDigital Library
    52. Supasorn Suwajanakorn, Ira Kemelmacher-Shlizerman, and Steven M. Seitz. 2014. Total Moving Face Reconstruction. In European Conference on Computer Vision (ECCV) (Lecture Notes in Computer Science), Vol. 8692. 796–812.Google Scholar
    53. Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2015. What Makes Tom Hanks Look Like Tom Hanks. In International Conference on Computer Vision (ICCV). 3952–3960. Google ScholarDigital Library
    54. Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning Lip Sync from Audio. ACM Transactions on Graphics (SIGGRAPH) 36, 4 (July 2017), 95:1–13. Google ScholarDigital Library
    55. Yaniv Taigman, Adam Polyak, and Lior Wolf. 2017. Unsupervised Cross-Domain Image Generation. In International Conference on Learning Representations (ICLR).Google Scholar
    56. Ayush Tewari, Michael Zollhöfer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard. Patrick Pérez, and Christian Theobalt. 2017. MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction. In International Conference on Computer Vision (ICCV). 3735–3744.Google Scholar
    57. Justus Thies, Michael Zollhöfer, Matthias Nießner, Levi Valgaerts, Marc Stamminger, and Christian Theobalt. 2015. Real-time Expression Transfer for Facial Reenactment. ACM Transactions on Graphics (SIGGRAPH Asia) 34, 6 (November 2015), 183:1–14. Google ScholarDigital Library
    58. Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2016. Face2Face: Real-Time Face Capture and Reenactment of RGB Videos. In Conference on Computer Vision and Pattern Recognition (CVPR). 2387–2395.Google ScholarDigital Library
    59. Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2018. FaceVR: Real-Time Facial Reenactment and Eye Gaze Control in Virtual Reality. ACM Transactions on Graphics (2018). Google ScholarDigital Library
    60. Anh Tuan Tran, Tal Hassner, Iacopo Masi, and Gerard Medioni. 2017. Regressing Robust and Discriminative 3D Morphable Models with a very Deep Neural Network. In Conference on Computer Vision and Pattern Recognition (CVPR). 1493–1502.Google ScholarCross Ref
    61. Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popović. 2005. Face Transfer with Multilinear Models. ACM Transactions on Graphics (SIGGRAPH) 24, 3 (July 2005), 426–433. Google ScholarDigital Library
    62. Chao Wang, Haiyong Zheng, Zhibin Yu, Ziqiang Zheng, Zhaorui Gu, and Bing Zheng. 2017. Discriminative Region Proposal Adversarial Networks for High-Quality Image-to-image Translation. (2017). https://arxiv.org/abs/711.09554 arXiv:1711.09554.Google Scholar
    63. Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
    64. Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. 2011. Realtime Performance-based Facial Animation. ACM Transactions on Graphics (SIGGRAPH) 30, 4 (July 2011), 77:1–10. Google ScholarDigital Library
    65. Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, and Andreas Bulling. 2018. GazeDirector: Fully articulated eye gaze redirection in video. Computer Graphics Forum (Eurographics) 37, 2 (2018).Google Scholar
    66. Chenglei Wu, Derek Bradley, Markus Gross, and Thabo Beeler. 2016. An Anatomically-Constrained Local Deformation Model for Monocular Face Capture. ACM Transactions on Graphics (SIGGRAPH) 35, 4 (July 2016), 115:1–12. Google ScholarDigital Library
    67. Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. DualGAN: Unsupervised Dual Learning for Image-to-image Translation. In International Conference on Computer Vision (ICCV). 2868–2876.Google Scholar
    68. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired Image-to-image Translation using Cycle-Consistent Adversarial Networks. In International Conference on Computer Vision (ICCV). 2242–2251.Google Scholar
    69. Michael Zollhöfer, Justus Thies, Pablo Garrido, Derek Bradley, Thabo Beeler, Patrick Pérez, Marc Stamminger, Matthias Nießner, and Christian Theobalt. 2018. State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications. Computer Graphics Forum 37, 2 (2018).Google Scholar


ACM Digital Library Publication:



Overview Page: