“LookinGood: enhancing performance capture with real-time neural re-rendering” – ACM SIGGRAPH HISTORY ARCHIVES

“LookinGood: enhancing performance capture with real-time neural re-rendering”

  • 2018 SA Technical Papers_Martin-Brualla_LookinGood: enhancing performance capture with real-time neural re-rendering


Abstract:


    Motivated by augmented and virtual reality applications such as telepresence, there has been a recent focus in real-time performance capture of humans under motion. However, given the real-time constraint, these systems often suffer from artifacts in geometry and texture such as holes and noise in the final rendering, poor lighting, and low-resolution textures. We take the novel approach to augment such real-time performance capture systems with a deep architecture that takes a rendering from an arbitrary viewpoint, and jointly performs completion, super resolution, and denoising of the imagery in real-time. We call this approach neural (re-)rendering, and our live system “LookinGood”. Our deep architecture is trained to produce high resolution and high quality images from a coarse rendering in real-time. First, we propose a self-supervised training method that does not require manual ground-truth annotation. We contribute a specialized reconstruction error that uses semantic information to focus on relevant parts of the subject, e.g. the face. We also introduce a salient reweighing scheme of the loss function that is able to discard outliers. We specifically design the system for virtual and augmented reality headsets where the consistency between the left and right eye plays a crucial role in the final user experience. Finally, we generate temporally stable results by explicitly minimizing the difference between two consecutive frames. We tested the proposed system in two different scenarios: one involving a single RGB-D sensor, and upper body reconstruction of an actor, the second consisting of full body 360° capture. Through extensive experimentation, we demonstrate how our system generalizes across unseen sequences and subjects.

References:


    1. Robert Anderson, David Gallup, Jonathan T Barron, Janne Kontkanen, Noah Snavely, Carlos Hernández, Sameer Agarwal, and Steven M Seitz. 2016. Jump: virtual reality video. ACM Transactions on Graphics (TOG) (2016). Google ScholarDigital Library
    2. Michael Bleyer, Christoph Rhemann, and Carsten Rother. 2011. PatchMatch Stereo-Stereo Matching with Slanted Support Windows.. In Bmvc, Vol. 11. 1–11.Google Scholar
    3. Joel Carranza, Christian Theobalt, Marcus A. Magnor, and Hans-Peter Seidel. 2003. Free-viewpoint Video of Human Actors (SIGGRAPH ’03). Google ScholarDigital Library
    4. Dan Casas, Marco Volino, John Collomosse, and Adrian Hilton. 2014. 4D Video Textures for Interactive Character Appearance. EUROGRAPHICS (2014).Google Scholar
    5. Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. CoRR abs/1802.02611 (2018).Google Scholar
    6. Qifeng Chen and Vladlen Koltun. 2017. Photographic Image Synthesis with Cascaded Refinement Networks. ICCV (2017).Google Scholar
    7. Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. 2015. High-quality Streamable Free-viewpoint Video. ACM TOG (2015). Google ScholarDigital Library
    8. Brian Curless and Marc Levoy. 1996. A Volumetric Method for Building Complex Models from Range Images. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques. Google ScholarDigital Library
    9. Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner. 2017. Shape completion using 3d-encoder-predictor cnns and shape synthesis. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
    10. D. Dai, R. Timofte, and L. Van Gool. 2015. Jointly Optimized Regressors for Image Super resolution. Computer Graphics Forum (2015). Google ScholarDigital Library
    11. Paul Debevec, Tim Hawkins, Chris Tchou, Haarm-Pieter Duiker, Westley Sarokin, and Mark Sagar. 2000. Acquiring the Reflectance Field of a Human Face. In SIGGRAPH. Google ScholarDigital Library
    12. Paul E. Debevec, Camillo J. Taylor, and Jitendra Malik. 1996. Modeling and Rendering Architecture from Photographs: A Hybrid Geometry and Image-based Approach. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques. Google ScholarDigital Library
    13. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.Google Scholar
    14. A. Dosovitskiy, J. T. Springenberg, M. Tatarchenko, and T. Brox. 2015. Learning to Generate Chairs with Convolutional Networks. CVPR (2015).Google Scholar
    15. Mingsong Dou, Philip Davidson, Sean Ryan Fanello, Sameh Khamis, Adarsh Kowdle, Christoph Rhemann, Vladimir Tankovich, and Shahram Izadi. 2017. Motion2Fusion: Real-time Volumetric Performance Capture. SIGGRAPH Asia (2017). Google ScholarDigital Library
    16. Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, Pushmeet Kohli, Vladimir Tankovich, and Shahram Izadi. 2016. Fusion4D: Real-time Performance Capture of Challenging Scenes. SIGGRAPH (2016). Google ScholarDigital Library
    17. Ruofei Du, Ming Chuang, Wayne Chang, Hugues Hoppe, and Amitabh Varshney. 2018. Montage4D: Interactive Seamless Fusion of Multiview Video Textures. In Proceedings of ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (I3D). Google ScholarDigital Library
    18. M. Eisemann, B. De Decker, M. Magnor, P. Bekaert, E. De Aguiar, N. Ahmed, C. Theobalt, and A. Sellent. 2008. Floating Textures. Computer Graphics Forum (2008).Google Scholar
    19. Daniel Evangelakos and Michael Mara. 2016. Extended TimeWarp latency compensation for virtual reality. Interactive 3D Graphics and Games (2016). Google ScholarDigital Library
    20. S. R. Fanello, C. Keskin, P. Kohli, S. Izadi, J. Shotton, A. Criminisi, U. Pattacini, and T. Paek. 2014. Filter Forests for Learning Data-Dependent Convolutional Kernels. In CVPR. Google ScholarDigital Library
    21. S. R. Fanello, C. Rhemann, V. Tankovich, A. Kowdle, S. Orts Escolano, D. Kim, and S. Izadi. 2016. HyperDepth: Learning Depth from Structured Light Without Matching. In CVPR.Google Scholar
    22. Sean Ryan Fanello, Julien Valentin, Adarsh Kowdle, Christoph Rhemann, Vladimir Tankovich, Carlo Ciliberto, Philip Davidson, and Shahram Izadi. 2017a. Low Compute and Fully Parallel Computer Vision with HashMatch. In ICCV.Google Scholar
    23. Sean Ryan Fanello, Julien Valentin, Christoph Rhemann, Adarsh Kowdle, Vladimir Tankovich, Philip Davidson, and Shahram Izadi. 2017b. UltraStereo: Efficient Learning-based Matching for Active Stereo Systems. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 6535–6544.Google ScholarCross Ref
    24. J. Flynn, I. Neulander, J. Philbin, and N. Snavely. 2016. Deep Stereo: Learning to Predict New Views from the World’s Imagery. In CVPR.Google Scholar
    25. G. Fyffe and P. Debevec. 2015. Single-Shot Reflectance Measurement from Polarized Color Gradient Illumination. In IEEE International Conference on Computational Photography.Google Scholar
    26. Gene H. Golub, Per Christian Hansen, and Dianne P. O’Leary. 1999. Tikhonov Regularization and Total Least Squares. SIAM (1999).Google Scholar
    27. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NIPS. Google ScholarDigital Library
    28. Steven J. Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F. Cohen. 1996. The Lumigraph. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’96). Google ScholarDigital Library
    29. X. Han, Z. Li, H. Huang, E. Kalogerakis, and Y. Yu. 2017. High-Resolution Shape Completion Using Deep Neural Networks for Global Structure and Local Geometry Inference. In IEEE International Conference on Computer Vision (ICCV).Google Scholar
    30. Matthias Innmann, Michael Zollhöfer, Matthias Nießner, Christian Theobalt, and Marc Stamminger. 2016. VolumeDeform: Real-time Volumetric Non-rigid Reconstruction. In Proceedings of European Conference on Computer Vision (ECCV).Google ScholarCross Ref
    31. Intel. 2016. freeD technology.Google Scholar
    32. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2016. Image-to-Image Translation with Conditional Adversarial Networks. arxiv (2016).Google Scholar
    33. Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial Transformer Networks. In NIPS. Google ScholarDigital Library
    34. L. C. Jain and L. R. Medsker. 1999. Recurrent Neural Networks: Design and Applications. CRC Press. Google ScholarDigital Library
    35. Jeremy Jancsary, Sebastian Nowozin, and Carsten Rother. 2012. Loss-specific Training of Non-parametric Image Restoration Models: A New State of the Art. In ECCV. Google ScholarDigital Library
    36. Dinghuang Ji, Junghyun Kwon, Max McFarland, and Silvio Savarese. 2017. Deep View Morphing. CoRR (2017).Google Scholar
    37. Justin Johnson, Alexandre Alahi, and Fei-Fei Li. 2016. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. CoRR (2016).Google Scholar
    38. Hanbyul Joo, Tomas Simon, and Yaser Sheikh. 2018. Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies. CVPR (2018).Google Scholar
    39. Michael Kazhdan and Hugues Hoppe. 2013. Screened Poisson Surface Reconstruction. ACM Trans. Graph. 32, 3, Article 29 (July 2013), 13 pages. Google ScholarDigital Library
    40. Sameh Khamis, Sean Ryan Fanello, Christoph Rhemann, Julien Valentin, Adarsh Kowdle, and Shahram Izadi. 2018. StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction. ECCV (2018).Google Scholar
    41. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR (2014).Google Scholar
    42. Adarsh Kowdle, Christoph Rhemann, Sean Fanello, Andrea Tagliasacchi, Jon Taylor, Philip Davidson, Mingsong Dou, Kaiwen Guo, Cem Keskin, Sameh Khamis, David Kim, Danhang Tang, Vladimir Tankovich, Julien Valentin, and Shahram Izadi. 2018. The Need 4 Speed in Real-Time Dense Visual Tracking. ACM SIGGRAPH ASIA and Transaction On Graphics (2018). Google ScholarDigital Library
    43. Philipp Krähenbühl and Vladlen Koltun. 2011. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. In NIPS. Google ScholarDigital Library
    44. Tejas D. Kulkarni, William F. Whitney, Pushmeet Kohli, and Joshua B. Tenenbaum. 2015. Deep Convolutional Inverse Graphics Network. In NIPS. Google ScholarDigital Library
    45. V. Lempitsky and D. Ivanov. 2007. Seamless Mosaicing of Image-Based Texture Maps. In CVPR.Google Scholar
    46. Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. 2017. Focal Loss for Dense Object Detection. CoRR (2017).Google Scholar
    47. R. A. Newcombe, D. Fox, and S. M. Seitz. 2015. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In CVPR.Google Scholar
    48. Harris Nover, Supreeth Achar, and Dan B Goldman. 2018. ESPReSSo: Efficient Slanted PatchMatch for Real-time Spacetime Stereo. 3DV (2018).Google Scholar
    49. Augustus Odena, Vincent Dumoulin, and Chris Olah. 2016. Deconvolution and Checkerboard Artifacts. Distill (2016).Google Scholar
    50. Sergio Orts-Escolano, Christoph Rhemann, Sean Fanello, Wayne Chang, Adarsh Kowdle, Yury Degtyarev, David Kim, Philip L. Davidson, Sameh Khamis, Mingsong Dou, Vladimir Tankovich, Charles Loop, Qin Cai, Philip A. Chou, Sarah Mennicken, Julien Valentin, Vivek Pradeep, Shenlong Wang, Sing Bing Kang, Pushmeet Kohli, Yuliya Lutchyn, Cem Keskin, and Shahram Izadi. 2016. Holoportation: Virtual 3D Teleportation in Real-time. In UIST. Google ScholarDigital Library
    51. E. Park, J. Yang, E. Yumer, D. Ceylan, and A. C. Berg. 2017. Transformation-Grounded Image Generation Network for Novel 3D View Synthesis. In CVPR.Google Scholar
    52. Fabián Prada, Misha Kazhdan, Ming Chuang, Alvaro Collet, and Hugues Hoppe. 2017. Spatiotemporal Atlas Parameterization for Evolving Meshes. ACM TOG. (2017). Google ScholarDigital Library
    53. Roveri Riccardo, Oztireli A. Cengiz, Pandele Ioana, and Gross Markus. 2018. Point-ProNets: Consolidation of Point Clouds with Convolutional Neural Networks. Computer Graphics Forum (2018).Google Scholar
    54. Christian Richardt, Yael Pritch, Henning Zimmer, and Alexander Sorkine-Hornung. 2013. Megastereo: Constructing High-Resolution Stereo Panoramas. In Conference on Computer Vision and Pattern Recognition (CVPR). Google ScholarDigital Library
    55. Gernot Riegler, RenÃl’ Ranftl, Matthias RÃijther, Thomas Pock, and Horst Bischof. 2015. Depth Restoration via Joint Training of a Global Regression Model and CNNs. In BMVC.Google Scholar
    56. Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. 2017. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
    57. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015).Google Scholar
    58. S. Schulter, C. Leistner, and H. Bischof. 2015. Fast and accurate image upscaling with super-resolution forests. In CVPR.Google Scholar
    59. Qi Shan, Riley Adams, Brian Curless, Yasutaka Furukawa, and Steven M. Seitz. 2013. The Visual Turing Test for Scene Reconstruction (3DV). Google ScholarDigital Library
    60. Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG) (2017). Google ScholarDigital Library
    61. Vladimir Tankovich, Michael Schoenberg, Sean Ryan Fanello, Adarsh Kowdle, Christoph Rhemann, Max Dzitsiuk, Mirko Schmidt, Julien Valentin, and Shahram Izadi. 2018. SOS: Stereo Matching in O(1) with Slanted Support Windows. IROS (2018).Google Scholar
    62. Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. 2016. Multi-view 3d models from single images with a convolutional network. ECCV (2016).Google Scholar
    63. Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2016. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarDigital Library
    64. JMP Van Waveren. 2016. The asynchronous time warp for virtual reality on consumer hardware. VRST (2016). Google ScholarDigital Library
    65. Marco Volino, Dan Casas, John Collomosse, and Adrian Hilton. 2014. Optimal Representation of Multiple View Video. In BMVC.Google Scholar
    66. Shenlong Wang, Sean Ryan Fanello, Christoph Rhemann, Shahram Izadi, and Pushmeet Kohli. 2016. The Global Patch Collider. CVPR (2016).Google Scholar
    67. Jimei Yang, Scott Reed, Ming-Hsuan Yang, and Honglak Lee. 2015. Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis. In NIPS. Google ScholarDigital Library
    68. Tao Yu, Kaiwen Guo, Feng Xu, Yuan Dong, Zhaoqi Su, Jianhui Zhao, Jianguo Li, Qionghai Dai, and Yebin Liu. 2017. BodyFusion: Real-time Capture of Human Motion and Surface Geometry Using a Single Depth Camera. In The IEEE International Conference on Computer Vision (ICCV). ACM.Google ScholarCross Ref
    69. Yinda Zhang, Sameh Khamis, Christoph Rhemann, Julien Valentin, Adarsh Kowdle, Vladimir Tankovich, Michael Schoenberg, Shahram Izadi, Thomas Funkhouser, and Sean Fanello. 2018. ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems. ECCV (2018).Google Scholar
    70. Kun Zhou, Xi Wang, Yiying Tong, Mathieu Desbrun, Baining Guo, and Heung-Yeung Shum. 2005. TextureMontage. ACM TOG (2005). Google ScholarDigital Library
    71. Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A. Efros. 2016. View Synthesis by Appearance Flow. CoRR (2016).Google Scholar
    72. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In ICCV.Google Scholar
    73. C. Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and Richard Szeliski. 2004. High-quality Video View Interpolation Using a Layered Representation. ACM TOG (2004). Google ScholarDigital Library
    74. Michael Zollhöfer, Matthias Nießner, Shahram Izadi, Christoph Rehmann, Christopher Zach, Matthew Fisher, Chenglei Wu, Andrew Fitzgibbon, Charles Loop, Christian Theobalt, and Marc Stamminger. 2014. Real-time Non-rigid Reconstruction using an RGB-D Camera. ACM Transactions on Graphics (TOG) (2014). Google ScholarDigital Library


ACM Digital Library Publication:



Overview Page:



Submit a story:

If you would like to submit a story about this presentation, please contact us: historyarchives@siggraph.org