Neural Rendering and Reenactment of Human Actor Videos

We propose a method for generating video-realistic animations of real humans under user control. In contrast to conventional human character rendering, we do not require the availability of a production-quality photo-realistic three-dimensional (3D) model of the human but instead rely on a video sequence in conjunction with a (medium-quality) controllable 3D template model of the person. With that, our approach significantly reduces production cost compared to conventional rendering approaches based on production-quality 3D models and can also be used to realistically edit existing videos. Technically, this is achieved by training a neural network that translates simple synthetic images of a human character into realistic imagery. For training our networks, we first track the 3D motion of the person in the video using the template model and subsequently generate a synthetically rendered version of the video. These images are then used to train a conditional generative adversarial network that translates synthetic images of the 3D model into realistic imagery of the human. We evaluate our method for the reenactment of another person that is tracked to obtain the motion data, and show video results generated from artist-designed skeleton motion. Our results outperform the state of the art in learning-based human image synthesis.

References:

Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https://www.tensorflow.org/.
Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. 2005. SCAPE: Shape completion and animation of people. ACM Trans. Graph. 24, 3 (Jul. 2005), 408–416. DOI:https://doi.org/10.1145/1073204.1073207
Guha Balakrishnan, Amy Zhao, Adrian V. Dalca, Fredo Durand, and John Guttag. 2018. Synthesizing images of humans in unseen poses. In Proceedings of the Computer Vision and Pattern Recognition (CVPR’18).
Pascal Bérard, Derek Bradley, Maurizio Nitti, Thabo Beeler, and Markus Gross. 2014. High-quality capture of eyes. ACM Trans. Graph. 33, 6 (2014), 223:1–223:12.
Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH’99). ACM Press/Addison-Wesley Publishing Co., New York, NY, 187–194.
Federica Bogo, Michael J. Black, Matthew Loper, and Javier Romero. 2015. Detailed full-body reconstructions of moving people from monocular RGB-D sequences. In Proceedings of the International Conference on Computer Vision (ICCV’15). 2300–2308.
Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. 2016. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Proceedings of the European Conference on Computer Vision (ECCV’16).
Cedric Cagniart, Edmond Boyer, and Slobodan Ilic. 2010. Free-form mesh tracking: A patch-based approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). IEEE, 1339–1346.
Joel Carranza, Christian Theobalt, Marcus A. Magnor, and Hans-Peter Seidel. 2003. Free-viewpoint video of human actors. ACM Trans. Graph. 22, 3 (Jul. 2003).
Dan Casas, Marco Volino, John Collomosse, and Adrian Hilton. 2014. 4D video textures for interactive character appearance. Comput. Graph. Forum 33, 2 (May 2014), 371–380. DOI:https://doi.org/10.1111/cgf.12296
Qifeng Chen and Vladlen Koltun. 2017. Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 1520–1529. DOI:https://doi.org/10.1109/ICCV.2017.168
Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018. StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the Computer Vision and Pattern Recognition (CVPR’18).
Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. 2015a. High-quality streamable free-viewpoint video. ACM Trans. Graph. 34, 4, Article 69 (Jul. 2015), 13 pages. DOI:https://doi.org/10.1145/2766945
Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. 2015b. High-quality streamable free-viewpoint video. ACM Trans. Graph. 34, 4 (2015), 69.
Edilson De Aguiar, Carsten Stoll, Christian Theobalt, Naveed Ahmed, Hans-Peter Seidel, and Sebastian Thrun. 2008. Performance capture from sparse multi-view video. In ACM SIGGRAPH 2008 Papers (SIGGRAPH’08). ACM, 98:1–98:10.
Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. 2017. CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning. 1–16.
Mingsong Dou, Philip Davidson, Sean Ryan Fanello, Sameh Khamis, Adarsh Kowdle, Christoph Rhemann, Vladimir Tankovich, and Shahram Izadi. 2017. Motion2Fusion: Real-time volumetric performance capture. ACM Trans. Graph. 36, 6, Article 246 (Nov. 2017), 16 pages.
Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, Pushmeet Kohli, Vladimir Tankovich, and Shahram Izadi. 2016. Fusion4D: Real-time performance capture of challenging scenes. ACM Trans. Graph. 35, 4, Article 114 (Jul. 2016), 13 pages. DOI:https://doi.org/10.1145/2897824.2925969
Ahmed Elhayek, Edilson de Aguiar, Arjun Jain, Jonathan Tompson, Leonid Pishchulin, Micha Andriluka, Chris Bregler, Bernt Schiele, and Christian Theobalt. 2015. Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3810–3818.
Patrick Esser, Ekaterina Sutter, and Björn Ommer. 2018. A variational U-net for conditional appearance and shape generation. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’18). 8857–8866. DOI:10.1109/CVPR.2018.00923
Juergen Gall, Carsten Stoll, Edilson De Aguiar, Christian Theobalt, Bodo Rosenhahn, and Hans-Peter Seidel. 2009. Motion capture using joint skeleton tracking and surface estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 1746–1753.
Yaroslav Ganin, Daniil Kononenko, Diana Sungatullina, and Victor S. Lempitsky. 2016. DeepWarp: Photorealistic image resynthesis for gaze manipulation. In Proceedings of the European Conference on Computer Vision (ECCV’16).
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14), Vol. 2. MIT Press, 2672–2680.
Thomas Helten, Meinard Muller, Hans-Peter Seidel, and Christian Theobalt. 2013. Real-time body tracking with one depth camera and inertial sensors. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’13).
Geoffrey E. Hinton and Ruslan Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (Jul. 2006), 504–507. DOI:https://doi.org/10.1126/science.1127647
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5967–5976. DOI:https://doi.org/10.1109/CVPR.2017.632
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive growing of GANs for improved quality, stability, and variation. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18), Vancouver, BC, Canada Conference Track Proceedings. https://openreview.net/forum?id=Hk99zCeAb.
H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, N. Nießner, P. Pérez, C. Richardt, M. Zollhöfer, and C. Theobalt. 2018. Deep video portraits. ACM Transactions on Graphics 37, 4 (2018), 163:1–163:14. DOI:10.1145/3197517.3201283
Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR’14), Banff, AB, Canada, Conference Track Proceeding. http://arxiv.org/abs/1312.6114
Christoph Lassner, Gerard Pons-Moll, and Peter V. Gehler. 2017. A generative model of people in clothing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17).
Guannan Li, Yebin Liu, and Qionghai Dai. 2014. Free-viewpoint video relighting from multi-view sequence under general illumination. Mach. Vision Appl. 25, 7 (Oct. 2014), 1737–1746. DOI:https://doi.org/10.1007/s00138-013-0559-0
Kun Li, Jingyu Yang, Leijie Liu, Ronan Boulic, Yu-Kun Lai, Yebin Liu, Yubin Li, and Eray Molla. 2017b. SPA: Sparse photorealistic animation using a single RGB-D camera. IEEE Trans. Circ. Syst. Vid. Technol. 27, 4 (Apr. 2017), 771–783. DOI:https://doi.org/10.1109/TCSVT.2016.2556419
Tianye Li, Timo Bolkart, Michael J. Black, Hao Li, and Javier Romero. 2017a. Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36, 6 (Nov. 2017), 194:1–194:17.
Ming-Yu Liu, Thomas Breuel, and Jan Kautz. 2017. Unsupervised image-to-image translation networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., 700–708. http://dl.acm.org/citation.cfm?id=3294771.3294838.
Yebin Liu, Carsten Stoll, Juergen Gall, Hans-Peter Seidel, and Christian Theobalt. 2011. Markerless motion capture of interacting characters using multi-view image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). IEEE, 1249–1256.
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: A skinned multi-person linear model. ACM Trans. Graph. (Proc. SIGGRAPH Asia) 34, 6 (Oct. 2015), 248:1–248:16.
Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc Van Gool, Bernt Schiele, and Mario Fritz. 2018. Disentangled person image generation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR’18).
Liqian Ma, Qianru Sun, Xu Jia, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. 2017. Pose guided person image generation. NIPS.
Wojciech Matusik, Chris Buehler, Ramesh Raskar, Steven J. Gortler, and Leonard McMillan. 2000. Image-based visual hulls. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques. ACM Press/Addison-Wesley Publishing Co., 369–374.
Dushyant Mehta, Helge Rhodin, Dan Casas, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. 2017a. Monocular 3D human pose estimation using transfer learning and improved CNN supervision. In Proceedings of the International Conference on 3D Vision (3DV’17).
Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. 2017b. VNect: Real-time 3D human pose estimation with a single RGB camera. ACM Trans. Graph. (Proceedings of SIGGRAPH 2017) 36, 4 (2017), 14.
Mehdi Mirza and Simon Osindero. 2014. Conditional Generative Adversarial Nets. (2014). https://arxiv.org/abs/1411.1784 arXiv:1411.1784.
Franziska Mueller, Florian Bernard, Oleksandr Sotnychenko, Dushyant Mehta, Srinath Sridhar, Dan Casas, and Christian Theobalt. 2017. GANerated hands for real-time 3D hand tracking from monocular RGB. CoRR abs/1712.01057 (2017).
Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. 2016. Conditional image generation with PixelCNN decoders. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16). Curran Associates Inc., 4797–4805.
Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, and Kostas Daniilidis. 2017. Coarse-to-fine volumetric prediction for single-image 3D human pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).
Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter Gehler, and Bernt Schiele. 2016. DeepCut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).
Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised representation learning with deep convolutional generative adversarial networks. In Proceedings of the 4th International Conference on Learning Representations (ICLR’16), San Juan, Puerto Rico, Conference Track Proceedings. http://arxiv.org/abs/1511.06434
Javier Romero, Dimitrios Tzionas, and Michael J. Black. 2017. Embodied hands: Modeling and capturing hands and bodies together. ACM Trans. Graph. (Proc. SIGGRAPH Asia) 36, 6 (Nov. 2017), 245:1–245:17. http://doi.acm.org/10.1145/3130800.3130883
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI’15). 234–241. DOI:https://doi.org/10.1007/978-3-319-24574-4_28
Rómer Rosales and Stan Sclaroff. 2006. Combining generative and discriminative models in a framework for articulated pose estimation. Int. J. Comput. Vis. 67, 3 (2006), 251–276.
Jason M. Saragih, Simon Lucey, and Jeffrey F. Cohn. 2011. Deformable model fitting by regularized landmark mean-shift. International Journal of Computer Vision 91, 2 (2011), 200–215. DOI:https://doi.org/10.1007/s11263-010-0380-4
Arno Schödl and Irfan A. Essa. 2002. Controlled animation of video sprites. In Proceedings of the 2002 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA’02). ACM, New York, NY, 121–127. DOI:https://doi.org/10.1145/545261.545281
Arno Schödl, Richard Szeliski, David H. Salesin, and Irfan Essa. 2000. Video textures. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH’00). ACM Press/Addison-Wesley Publishing Co., New York, NY, 489–498. DOI:https://doi.org/10.1145/344779.345012
A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. 2017. Learning from simulated and unsupervised images through adversarial training. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2242–2251. DOI:https://doi.org/10.1109/CVPR.2017.241
Aliaksandr Siarohin, Enver Sangineto, Stephane Lathuiliere, and Nicu Sebe. 2018. Deformable GANs for pose-based human image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).
Jonathan Starck and Adrian Hilton. 2007. Surface capture for performance-based animation. IEEE Comput. Graph. Appl. 27, 3 (2007), 21–31.
Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan Popović. 2008. Articulated mesh animation from multi-view silhouettes. In ACM Transactions on Graphics, Vol. 27. ACM, 97.
Daniel Vlasic, Pieter Peers, Ilya Baran, Paul Debevec, Jovan Popović, Szymon Rusinkiewicz, and Wojciech Matusik. 2009. Dynamic shape capture using multi-view photometric stereo. ACM Trans. Graph. 28, 5 (2009), 174.
Marco Volino, Dan Casas, John Collomosse, and Adrian Hilton. 2014. Optimal representation of multiple view video. In Proceedings of the British Machine Vision Conference. BMVA Press.
Ruizhe Wang, Lingyu Wei, Etienne Vouga, Qixing Huang, Duygu Ceylan, Gerard Medioni, and Hao Li. 2016. Capturing dynamic textured surfaces of moving targets. In Proceedings of the European Conference on Computer Vision (ECCV’16).
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-resolution image synthesis and semantic manipulation with conditional GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Michael Waschbüsch, Stephan Würmlin, Daniel Cotting, Filip Sadlo, and Markus Gross. 2005. Scalable 3D video of dynamic scenes. Vis. Comput. 21, 8–10 (2005), 629–638.
Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’16).
E. Wood, T. Baltrusaitis, L. P. Morency, P. Robinson, and A. Bulling. 2016. A 3D morphable eye region model for gaze estimation. In Proceedings of the European Conference on Computer Vision (ECCV’16).
Chenglei Wu, Derek Bradley, Pablo Garrido, Michael Zollhöfer, Christian Theobalt, Markus Gross, and Thabo Beeler. 2016. Model-based teeth reconstruction. ACM Trans. Graph. 35, 6, Article 220 (2016), 220:1–220:13 pages.
Chenglei Wu, Carsten Stoll, Levi Valgaerts, and Christian Theobalt. 2013. On-set performance capture of multiple actors with a stereo camera. In ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2013), Vol. 32. 161:1–161:11. DOI:https://doi.org/10.1145/2508363.2508418
Feng Xu, Yebin Liu, Carsten Stoll, James Tompkin, Gaurav Bharaj, Qionghai Dai, Hans-Peter Seidel, Jan Kautz, and Christian Theobalt. 2011. Video-based characters: Creating new human performances from a multi-view video database. In ACM SIGGRAPH 2011 Papers (SIGGRAPH’11). ACM, New York, NY, Article 32, 10 pages. DOI:https://doi.org/10.1145/1964921.1964927
Weipeng Xu, Avishek Chatterjee, Michael Zollhoefer, Helge Rhodin, Dushyant Mehta, Hans-Peter Seidel, and Christian Theobalt. 2018. MonoPerfCap: Human performance capture from monocular video. ACM Transactions on Graphics 37, 2 (2018), 27:1–27:15. DOI:10.1145/3181973
Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. DualGAN: Unsupervised dual learning for image-to-image translation. IEEE International Conference on Computer Vision (ICCV’17) 2868–2876. DOI:https://doi.org/10.1109/ICCV.2017.310
T. Yu, K. Guo, F. Xu, Y. Dong, Z. Su, J. Zhao, J. Li, Q. Dai, and Y. Liu. 2017. BodyFusion: Real-time capture of human motion and surface geometry using a single depth camera. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17). ACM, 910–919. DOI:https://doi.org/10.1109/ICCV.2017.104
Qing Zhang, Bo Fu, Mao Ye, and Ruigang Yang. 2014. Quality dynamic human body modeling using a single low-cost depth camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). IEEE, 676–683.
Xingyi Zhou, Xiao Sun, Wei Zhang, Shuang Liang, and Yichen Wei. 2016. Deep kinematic pose regression. ECCV Workshops.
Hao Zhu, Hao Su, Peng Wang, Xun Cao, and Ruigang Yang. 2018. View extrapolation of human body from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision (ICCV’17). 2242–2251. DOI:https://doi.org/10.1109/ICCV.2017.244
C. Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and Richard Szeliski. 2004. High-quality video view interpolation using a layered representation. In ACM Transactions on Graphics (TOG), Vol. 23. ACM, 600–608.

ACM Digital Library Publication:

Overview Page:

SIGGRAPH 2019: Technical Papers

“Neural Rendering and Reenactment of Human Actor Videos” by Liu, Xu, Zollhoefer, Kim, Bernard, et al. …

Conference:

Type(s):

Title:

Session/Category Title: Neural Rendering

Presenter(s)/Author(s):

Abstract:

References:

ACM Digital Library Publication:

Overview Page:

Sponsored by: