“LookinGood: enhancing performance capture with real-time neural re-rendering”
Conference:
Type(s):
Title:
- LookinGood: enhancing performance capture with real-time neural re-rendering
Session/Category Title: Capturing 4D performances
Presenter(s)/Author(s):
- Ricardo Martin-Brualla
- Rohit Pandey
- Shuoran Yang
- Pavel Pidlypenskyi
- Jonathan Taylor
- Julien Valentin
- Sameh Khamis
- Philip L. Davidson
- Anastasia Tkach
- Peter Lincoln
- Adarsh Kowdle
- Christoph Rhemann
- Daniel (Dan) B. Goldman
- Cem Keskin
- Steven M. Seitz
- Shahram Izadi
- Sean Ryan Fanello
Moderator(s):
Abstract:
Motivated by augmented and virtual reality applications such as telepresence, there has been a recent focus in real-time performance capture of humans under motion. However, given the real-time constraint, these systems often suffer from artifacts in geometry and texture such as holes and noise in the final rendering, poor lighting, and low-resolution textures. We take the novel approach to augment such real-time performance capture systems with a deep architecture that takes a rendering from an arbitrary viewpoint, and jointly performs completion, super resolution, and denoising of the imagery in real-time. We call this approach neural (re-)rendering, and our live system “LookinGood”. Our deep architecture is trained to produce high resolution and high quality images from a coarse rendering in real-time. First, we propose a self-supervised training method that does not require manual ground-truth annotation. We contribute a specialized reconstruction error that uses semantic information to focus on relevant parts of the subject, e.g. the face. We also introduce a salient reweighing scheme of the loss function that is able to discard outliers. We specifically design the system for virtual and augmented reality headsets where the consistency between the left and right eye plays a crucial role in the final user experience. Finally, we generate temporally stable results by explicitly minimizing the difference between two consecutive frames. We tested the proposed system in two different scenarios: one involving a single RGB-D sensor, and upper body reconstruction of an actor, the second consisting of full body 360° capture. Through extensive experimentation, we demonstrate how our system generalizes across unseen sequences and subjects.
References:
1. Robert Anderson, David Gallup, Jonathan T Barron, Janne Kontkanen, Noah Snavely, Carlos Hernández, Sameer Agarwal, and Steven M Seitz. 2016. Jump: virtual reality video. ACM Transactions on Graphics (TOG) (2016). Google ScholarDigital Library
2. Michael Bleyer, Christoph Rhemann, and Carsten Rother. 2011. PatchMatch Stereo-Stereo Matching with Slanted Support Windows.. In Bmvc, Vol. 11. 1–11.Google Scholar
3. Joel Carranza, Christian Theobalt, Marcus A. Magnor, and Hans-Peter Seidel. 2003. Free-viewpoint Video of Human Actors (SIGGRAPH ’03). Google ScholarDigital Library
4. Dan Casas, Marco Volino, John Collomosse, and Adrian Hilton. 2014. 4D Video Textures for Interactive Character Appearance. EUROGRAPHICS (2014).Google Scholar
5. Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. CoRR abs/1802.02611 (2018).Google Scholar
6. Qifeng Chen and Vladlen Koltun. 2017. Photographic Image Synthesis with Cascaded Refinement Networks. ICCV (2017).Google Scholar
7. Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. 2015. High-quality Streamable Free-viewpoint Video. ACM TOG (2015). Google ScholarDigital Library
8. Brian Curless and Marc Levoy. 1996. A Volumetric Method for Building Complex Models from Range Images. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques. Google ScholarDigital Library
9. Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner. 2017. Shape completion using 3d-encoder-predictor cnns and shape synthesis. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
10. D. Dai, R. Timofte, and L. Van Gool. 2015. Jointly Optimized Regressors for Image Super resolution. Computer Graphics Forum (2015). Google ScholarDigital Library
11. Paul Debevec, Tim Hawkins, Chris Tchou, Haarm-Pieter Duiker, Westley Sarokin, and Mark Sagar. 2000. Acquiring the Reflectance Field of a Human Face. In SIGGRAPH. Google ScholarDigital Library
12. Paul E. Debevec, Camillo J. Taylor, and Jitendra Malik. 1996. Modeling and Rendering Architecture from Photographs: A Hybrid Geometry and Image-based Approach. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques. Google ScholarDigital Library
13. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.Google Scholar
14. A. Dosovitskiy, J. T. Springenberg, M. Tatarchenko, and T. Brox. 2015. Learning to Generate Chairs with Convolutional Networks. CVPR (2015).Google Scholar
15. Mingsong Dou, Philip Davidson, Sean Ryan Fanello, Sameh Khamis, Adarsh Kowdle, Christoph Rhemann, Vladimir Tankovich, and Shahram Izadi. 2017. Motion2Fusion: Real-time Volumetric Performance Capture. SIGGRAPH Asia (2017). Google ScholarDigital Library
16. Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, Pushmeet Kohli, Vladimir Tankovich, and Shahram Izadi. 2016. Fusion4D: Real-time Performance Capture of Challenging Scenes. SIGGRAPH (2016). Google ScholarDigital Library
17. Ruofei Du, Ming Chuang, Wayne Chang, Hugues Hoppe, and Amitabh Varshney. 2018. Montage4D: Interactive Seamless Fusion of Multiview Video Textures. In Proceedings of ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (I3D). Google ScholarDigital Library
18. M. Eisemann, B. De Decker, M. Magnor, P. Bekaert, E. De Aguiar, N. Ahmed, C. Theobalt, and A. Sellent. 2008. Floating Textures. Computer Graphics Forum (2008).Google Scholar
19. Daniel Evangelakos and Michael Mara. 2016. Extended TimeWarp latency compensation for virtual reality. Interactive 3D Graphics and Games (2016). Google ScholarDigital Library
20. S. R. Fanello, C. Keskin, P. Kohli, S. Izadi, J. Shotton, A. Criminisi, U. Pattacini, and T. Paek. 2014. Filter Forests for Learning Data-Dependent Convolutional Kernels. In CVPR. Google ScholarDigital Library
21. S. R. Fanello, C. Rhemann, V. Tankovich, A. Kowdle, S. Orts Escolano, D. Kim, and S. Izadi. 2016. HyperDepth: Learning Depth from Structured Light Without Matching. In CVPR.Google Scholar
22. Sean Ryan Fanello, Julien Valentin, Adarsh Kowdle, Christoph Rhemann, Vladimir Tankovich, Carlo Ciliberto, Philip Davidson, and Shahram Izadi. 2017a. Low Compute and Fully Parallel Computer Vision with HashMatch. In ICCV.Google Scholar
23. Sean Ryan Fanello, Julien Valentin, Christoph Rhemann, Adarsh Kowdle, Vladimir Tankovich, Philip Davidson, and Shahram Izadi. 2017b. UltraStereo: Efficient Learning-based Matching for Active Stereo Systems. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 6535–6544.Google ScholarCross Ref
24. J. Flynn, I. Neulander, J. Philbin, and N. Snavely. 2016. Deep Stereo: Learning to Predict New Views from the World’s Imagery. In CVPR.Google Scholar
25. G. Fyffe and P. Debevec. 2015. Single-Shot Reflectance Measurement from Polarized Color Gradient Illumination. In IEEE International Conference on Computational Photography.Google Scholar
26. Gene H. Golub, Per Christian Hansen, and Dianne P. O’Leary. 1999. Tikhonov Regularization and Total Least Squares. SIAM (1999).Google Scholar
27. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NIPS. Google ScholarDigital Library
28. Steven J. Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F. Cohen. 1996. The Lumigraph. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’96). Google ScholarDigital Library
29. X. Han, Z. Li, H. Huang, E. Kalogerakis, and Y. Yu. 2017. High-Resolution Shape Completion Using Deep Neural Networks for Global Structure and Local Geometry Inference. In IEEE International Conference on Computer Vision (ICCV).Google Scholar
30. Matthias Innmann, Michael Zollhöfer, Matthias Nießner, Christian Theobalt, and Marc Stamminger. 2016. VolumeDeform: Real-time Volumetric Non-rigid Reconstruction. In Proceedings of European Conference on Computer Vision (ECCV).Google ScholarCross Ref
31. Intel. 2016. freeD technology.Google Scholar
32. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2016. Image-to-Image Translation with Conditional Adversarial Networks. arxiv (2016).Google Scholar
33. Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial Transformer Networks. In NIPS. Google ScholarDigital Library
34. L. C. Jain and L. R. Medsker. 1999. Recurrent Neural Networks: Design and Applications. CRC Press. Google ScholarDigital Library
35. Jeremy Jancsary, Sebastian Nowozin, and Carsten Rother. 2012. Loss-specific Training of Non-parametric Image Restoration Models: A New State of the Art. In ECCV. Google ScholarDigital Library
36. Dinghuang Ji, Junghyun Kwon, Max McFarland, and Silvio Savarese. 2017. Deep View Morphing. CoRR (2017).Google Scholar
37. Justin Johnson, Alexandre Alahi, and Fei-Fei Li. 2016. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. CoRR (2016).Google Scholar
38. Hanbyul Joo, Tomas Simon, and Yaser Sheikh. 2018. Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies. CVPR (2018).Google Scholar
39. Michael Kazhdan and Hugues Hoppe. 2013. Screened Poisson Surface Reconstruction. ACM Trans. Graph. 32, 3, Article 29 (July 2013), 13 pages. Google ScholarDigital Library
40. Sameh Khamis, Sean Ryan Fanello, Christoph Rhemann, Julien Valentin, Adarsh Kowdle, and Shahram Izadi. 2018. StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction. ECCV (2018).Google Scholar
41. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR (2014).Google Scholar
42. Adarsh Kowdle, Christoph Rhemann, Sean Fanello, Andrea Tagliasacchi, Jon Taylor, Philip Davidson, Mingsong Dou, Kaiwen Guo, Cem Keskin, Sameh Khamis, David Kim, Danhang Tang, Vladimir Tankovich, Julien Valentin, and Shahram Izadi. 2018. The Need 4 Speed in Real-Time Dense Visual Tracking. ACM SIGGRAPH ASIA and Transaction On Graphics (2018). Google ScholarDigital Library
43. Philipp Krähenbühl and Vladlen Koltun. 2011. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. In NIPS. Google ScholarDigital Library
44. Tejas D. Kulkarni, William F. Whitney, Pushmeet Kohli, and Joshua B. Tenenbaum. 2015. Deep Convolutional Inverse Graphics Network. In NIPS. Google ScholarDigital Library
45. V. Lempitsky and D. Ivanov. 2007. Seamless Mosaicing of Image-Based Texture Maps. In CVPR.Google Scholar
46. Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. 2017. Focal Loss for Dense Object Detection. CoRR (2017).Google Scholar
47. R. A. Newcombe, D. Fox, and S. M. Seitz. 2015. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In CVPR.Google Scholar
48. Harris Nover, Supreeth Achar, and Dan B Goldman. 2018. ESPReSSo: Efficient Slanted PatchMatch for Real-time Spacetime Stereo. 3DV (2018).Google Scholar
49. Augustus Odena, Vincent Dumoulin, and Chris Olah. 2016. Deconvolution and Checkerboard Artifacts. Distill (2016).Google Scholar
50. Sergio Orts-Escolano, Christoph Rhemann, Sean Fanello, Wayne Chang, Adarsh Kowdle, Yury Degtyarev, David Kim, Philip L. Davidson, Sameh Khamis, Mingsong Dou, Vladimir Tankovich, Charles Loop, Qin Cai, Philip A. Chou, Sarah Mennicken, Julien Valentin, Vivek Pradeep, Shenlong Wang, Sing Bing Kang, Pushmeet Kohli, Yuliya Lutchyn, Cem Keskin, and Shahram Izadi. 2016. Holoportation: Virtual 3D Teleportation in Real-time. In UIST. Google ScholarDigital Library
51. E. Park, J. Yang, E. Yumer, D. Ceylan, and A. C. Berg. 2017. Transformation-Grounded Image Generation Network for Novel 3D View Synthesis. In CVPR.Google Scholar
52. Fabián Prada, Misha Kazhdan, Ming Chuang, Alvaro Collet, and Hugues Hoppe. 2017. Spatiotemporal Atlas Parameterization for Evolving Meshes. ACM TOG. (2017). Google ScholarDigital Library
53. Roveri Riccardo, Oztireli A. Cengiz, Pandele Ioana, and Gross Markus. 2018. Point-ProNets: Consolidation of Point Clouds with Convolutional Neural Networks. Computer Graphics Forum (2018).Google Scholar
54. Christian Richardt, Yael Pritch, Henning Zimmer, and Alexander Sorkine-Hornung. 2013. Megastereo: Constructing High-Resolution Stereo Panoramas. In Conference on Computer Vision and Pattern Recognition (CVPR). Google ScholarDigital Library
55. Gernot Riegler, RenÃl’ Ranftl, Matthias RÃijther, Thomas Pock, and Horst Bischof. 2015. Depth Restoration via Joint Training of a Global Regression Model and CNNs. In BMVC.Google Scholar
56. Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. 2017. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
57. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015).Google Scholar
58. S. Schulter, C. Leistner, and H. Bischof. 2015. Fast and accurate image upscaling with super-resolution forests. In CVPR.Google Scholar
59. Qi Shan, Riley Adams, Brian Curless, Yasutaka Furukawa, and Steven M. Seitz. 2013. The Visual Turing Test for Scene Reconstruction (3DV). Google ScholarDigital Library
60. Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG) (2017). Google ScholarDigital Library
61. Vladimir Tankovich, Michael Schoenberg, Sean Ryan Fanello, Adarsh Kowdle, Christoph Rhemann, Max Dzitsiuk, Mirko Schmidt, Julien Valentin, and Shahram Izadi. 2018. SOS: Stereo Matching in O(1) with Slanted Support Windows. IROS (2018).Google Scholar
62. Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. 2016. Multi-view 3d models from single images with a convolutional network. ECCV (2016).Google Scholar
63. Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2016. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarDigital Library
64. JMP Van Waveren. 2016. The asynchronous time warp for virtual reality on consumer hardware. VRST (2016). Google ScholarDigital Library
65. Marco Volino, Dan Casas, John Collomosse, and Adrian Hilton. 2014. Optimal Representation of Multiple View Video. In BMVC.Google Scholar
66. Shenlong Wang, Sean Ryan Fanello, Christoph Rhemann, Shahram Izadi, and Pushmeet Kohli. 2016. The Global Patch Collider. CVPR (2016).Google Scholar
67. Jimei Yang, Scott Reed, Ming-Hsuan Yang, and Honglak Lee. 2015. Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis. In NIPS. Google ScholarDigital Library
68. Tao Yu, Kaiwen Guo, Feng Xu, Yuan Dong, Zhaoqi Su, Jianhui Zhao, Jianguo Li, Qionghai Dai, and Yebin Liu. 2017. BodyFusion: Real-time Capture of Human Motion and Surface Geometry Using a Single Depth Camera. In The IEEE International Conference on Computer Vision (ICCV). ACM.Google ScholarCross Ref
69. Yinda Zhang, Sameh Khamis, Christoph Rhemann, Julien Valentin, Adarsh Kowdle, Vladimir Tankovich, Michael Schoenberg, Shahram Izadi, Thomas Funkhouser, and Sean Fanello. 2018. ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems. ECCV (2018).Google Scholar
70. Kun Zhou, Xi Wang, Yiying Tong, Mathieu Desbrun, Baining Guo, and Heung-Yeung Shum. 2005. TextureMontage. ACM TOG (2005). Google ScholarDigital Library
71. Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A. Efros. 2016. View Synthesis by Appearance Flow. CoRR (2016).Google Scholar
72. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In ICCV.Google Scholar
73. C. Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and Richard Szeliski. 2004. High-quality Video View Interpolation Using a Layered Representation. ACM TOG (2004). Google ScholarDigital Library
74. Michael Zollhöfer, Matthias Nießner, Shahram Izadi, Christoph Rehmann, Christopher Zach, Matthew Fisher, Chenglei Wu, Andrew Fitzgibbon, Charles Loop, Christian Theobalt, and Marc Stamminger. 2014. Real-time Non-rigid Reconstruction using an RGB-D Camera. ACM Transactions on Graphics (TOG) (2014). Google ScholarDigital Library


