Neural volumes: learning dynamic renderable volumes from images

Modeling and rendering of dynamic scenes is challenging, as natural scenes often contain complex phenomena such as thin structures, evolving topology, translucency, scattering, occlusion, and biological motion. Mesh-based reconstruction and tracking often fail in these cases, and other approaches (e.g., light field video) typically rely on constrained viewing conditions, which limit interactivity. We circumvent these difficulties by presenting a learning-based approach to representing dynamic objects inspired by the integral projection model used in tomographic imaging. The approach is supervised directly from 2D images in a multi-view capture setting and does not require explicit reconstruction or tracking of the object. Our method has two primary components: an encoder-decoder network that transforms input images into a 3D volume representation, and a differentiable ray-marching operation that enables end-to-end training. By virtue of its 3D representation, our construction extrapolates better to novel viewpoints compared to screen-space rendering techniques. The encoder-decoder architecture learns a latent representation of a dynamic scene that enables us to produce novel content sequences not seen during training. To overcome memory limitations of voxel-based representations, we learn a dynamic irregular grid structure implemented with a warp field during ray-marching. This structure greatly improves the apparent resolution and reduces grid-like artifacts and jagged motion. Finally, we demonstrate how to incorporate surface-based representations into our volumetric-learning framework for applications where the highest resolution is required, using facial performance capture as a case in point.

References:

1. Henrik Aanæs, Rasmus Ramsbøl Jensen, George Vogiatzis, Engin Tola, and Anders Bjorholm Dahl. 2016. Large-Scale Data for Multiple-View Stereopsis. International Journal of Computer Vision (IJCV) (2016). Google ScholarDigital Library
2. Agisoft. 2019. Metashape. https://www.agisoft.com/.Google Scholar
3. Bradley Atcheson, Ivo Ihrke, Wolfgang Heidrich, Art Tevs, Derek Bradley, Marcus Magnor, and Hans-Peter Seidel. 2008. Time-resolved 3D Capture of Non-stationary Gas Flows. ACM Trans. Graph. 27, 5 (2008). Google ScholarDigital Library
4. Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman, Robert W. Sumner, and Markus Gross. 2011. High-quality Passive Facial Performance Capture Using Anchor Frames. ACM Trans. Graph. 30, 4 (2011). Google ScholarDigital Library
5. Jeremy S. De Bonet and Paul A. Viola. 1999. Poxels: Probabilistic Voxelized Volume Reconstruction. In International Conference on Computer Vision (ICCV).Google Scholar
6. Adrian Broadhurst, Tom Drummond, and Roberto Cipolla. 2001. A Probabilistic Framework for Space Carving. In International Conference on Computer Vision (ICCV).Google ScholarCross Ref
7. Chris Buehler, Michael Bosse, Leonard McMillan, Steven Gortler, and Michael Cohen. 2001. Unstructured Lumigraph Rendering. In Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). Google ScholarDigital Library
8. Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. 2015. High-quality Streamable Free-viewpoint Video. ACM Trans. Graph. 34, 4 (2015). Google ScholarDigital Library
9. J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. 2017. Deformable Convolutional Networks. In International Conference on Computer Vision (ICCV).Google Scholar
10. Abe Davis, Marc Levoy, and Fredo Durand. 2012. Unstructured Light Fields. Computer Graphics Forum 31, 2pt1 (2012). Google ScholarDigital Library
11. Andrew Fitzgibbon and Andrew Zisserman. 2005. Image-based Rendering Using Image-based Priors. In International Conference on Computer Vision (ICCV). Google ScholarDigital Library
12. Yasutaka Furukawa and Carlos Hernández. 2015. Multi-View Stereo: A Tutorial. Foundations and Trends in Computer Graphics and Vision 9, 1–2 (2015). Google ScholarDigital Library
13. Yasutaka Furukawa and Jean Ponce. 2010. Accurate, Dense, and Robust Multiview Stereopsis. Pattern Analysis and Machine Intelligence (PAMI) 32, 8 (2010). Google ScholarDigital Library
14. G. Fyffe, K. Nagano, L. Huynh, S. Saito, J. Busch, A. Jones, H. Li, and P. Debevec. 2017. Multi-View Stereo on Consistent Face Topology. Computer Graphics Forum 36, 2 (2017). Google ScholarDigital Library
15. Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and Steven M. Seitz. 2007. Multi-View Stereo for Community Photo Collections. In International Conference on Computer Vision (ICCV).Google Scholar
16. H. Ha, M. Perdoch, H. Alismail, I. S. Kweon, and Y. Sheikh. 2017. Deltille Grids for Geometric Camera Calibration. In International Conference on Computer Vision (ICCV).Google Scholar
17. Tim Hawkins, Per Einarsson, and Paul Debevec. 2005. Acquisition of Time-varying Participating Media. ACM Trans. Graph. 24, 3 (2005). Google ScholarDigital Library
18. Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, George Drettakis, and Gabriel Brostow. 2018. Deep Blending for Free-Viewpoint Image-Based Rendering. ACM Trans. Graph. 37, 6 (2018). Google ScholarDigital Library
19. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In International Conference on Learning Representations (ICLR).Google Scholar
20. Milan Ikits, Joe Kniss, Aaron Lefohn, and Charles Hansen. 2004. GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics (Chapter 39, Volume Rendering Techniques). Addison Wesley.Google Scholar
21. Matthias Innmann, Michael Zollhöfer, Matthias Nießner, Christian Theobalt, and Marc Stamminger. 2016. VolumeDeform: Real-Time Volumetric Non-rigid Reconstruction. In European Conference on Computer Vision (ECCV).Google Scholar
22. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. Computer Vision and Pattern Recognition (CVPR) (2017).Google Scholar
23. Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial Transformer Networks. In Advances in Neural Information Processing Systems (NeurIPS). Google ScholarDigital Library
24. Nima Khademi Kalantari, Ting-Chun Wang, and Ravi Ramamoorthi. 2016. Learning-based View Synthesis for Light Field Cameras. ACM Trans. Graph. 35, 6 (2016). Google ScholarDigital Library
25. Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In International Conference on Learning Representations (ICLR).Google Scholar
26. Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Nießner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Christian Theobalt. 2018. Deep Video Portraits. ACM Trans. Graph. 37, 4 (2018). Google ScholarDigital Library
27. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference for Learning Representations (ICLR).Google Scholar
28. Diederik P. Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. In International Conference on Learning Representations (ICLR).Google Scholar
29. Kiriakos N. Kutulakos and Steven M. Seitz. 2000. A Theory of Shape by Space Carving. International Journal of Computer Vision 38, 3 (2000). Google ScholarDigital Library
30. Marc Levoy. 1988. Display of Surfaces from Volume Data. IEEE Computer Graphics and Applications 8, 3 (1988). Google ScholarDigital Library
31. J. P. Lewis, Matt Cordner, and Nickson Fong. 2000. Pose Space Deformation: A Unified Approach to Shape Interpolation and Skeleton-driven Deformation. In Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). Google ScholarDigital Library
32. Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep Appearance Models for Face Rendering. ACM Trans. Graph. 37, 4 (2018). Google ScholarDigital Library
33. Ricardo Martin-Brualla, Rohit Pandey, Shuoran Yang, Pavel Pidlypenskyi, Jonathan Taylor, Julien Valentin, Sameh Khamis, Philip Davidson, Anastasia Tkach, Peter Lincoln, Adarsh Kowdle, Christoph Rhemann, Dan B Goldman, Cem Keskin, Steve Seitz, Shahram Izadi, and Sean Fanello. 2018. LookinGood: Enhancing Performance Capture with Real-time Neural Re-rendering. ACM Trans. Graph. 37, 6 (2018). Google ScholarDigital Library
34. Paul Merrell, Amir Akbarzadeh, Liang Wang, Philippos Mordohai, Jan-Michael Frahm, Ruigang Yang, David NistÃl’r, and Marc Pollefeys. 2007. Real-Time Visibility-Based Fusion of Depth Maps. In International Conference on Computer Vision (ICCV).Google Scholar
35. Richard A. Newcombe, Dieter Fox, and Steven M. Seitz. 2015. DynamicFusion: Reconstruction and Tracking of Non-Rigid Scenes in Real-Time. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
36. Thu H Nguyen-Phuoc, Chuan Li, Stephen Balaban, and Yongliang Yang. 2018. RenderNet: A Deep Convolutional Network for Differentiable Rendering from 3D Shapes. In Advances in Neural Information Processing Systems (NeurIPS). Google ScholarDigital Library
37. Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and Marc Stamminger. 2013. Real-time 3D Reconstruction at Scale Using Voxel Hashing. ACM Trans. Graph. 32, 6 (2013). Google ScholarDigital Library
38. Ryan S. Overbeck, Daniel Erickson, Daniel Evangelakos, Matt Pharr, and Paul Debevec. 2018. A System for Acquiring, Processing, and Rendering Panoramic Light Field Stills for Virtual Reality. ACM Trans. Graph. 37, 6 (2018). Google ScholarDigital Library
39. Despoina Paschalidou, Ali Osman Ulusoy, Carolin Schmitt, Luc Gool, and Andreas Geiger. 2018. RayNet: Learning Volumetric 3D Reconstruction with Ray Potentials. In Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
40. Eric Penner and Li Zhang. 2017. Soft 3D Reconstruction for View Synthesis. ACM Trans. Graph. 36, 6 (2017). Google ScholarDigital Library
41. Fabián Prada, Misha Kazhdan, Ming Chuang, Alvaro Collet, and Hugues Hoppe. 2016. Motion Graphs for Unstructured Textured Meshes. ACM Trans. Graph. 35, 4 (2016). Google ScholarDigital Library
42. Andrew Prock and Charles R. Dyer. 1998. Towards Real-Time Voxel Coloring. In Image Understanding Workshop.Google Scholar
43. Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. 2017. OctNet: Learning Deep 3D Representations at High Resolutions. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
44. S. Roth and M. J. Black. 2006. Specular Flow and the Recovery of Surface Structure. In Computer Vision and Pattern Recognition (CVPR). Google ScholarDigital Library
45. Nikolay Savinov, Christian Häne, Lubor Ladicky, and Marc Pollefeys. 2016. Semantic 3D Reconstruction with Continuous Regularization and Ray Potentials Using a Visibility Consistency Constraint. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
46. Daniel Scharstein and Richard Szeliski. 2002. A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. International Journal of Computer Vision (IJCV) (2002). Google ScholarDigital Library
47. Johannes Lutz Schönberger and Jan-Michael Frahm. 2016. Structure-from-Motion Revisited. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
48. Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. 2016. Pixelwise View Selection for Unstructured Multi-View Stereo. In European Conference on Computer Vision (ECCV).Google Scholar
49. Steven M. Seitz and Charles R. Dyer. 1997. Photorealistic Scene Reconstruction by Voxel Coloring.Google Scholar
50. Steven M. Seitz and Charles R. Dyer. 1999. Photorealistic Scene Reconstruction by Voxel Coloring. International Journal of Computer Vision 35, 2 (1999). Google ScholarDigital Library
51. Zhixin Shu, Mihir Sahasrabudhe, Riza Alp Guler, Dimitris Samaras, Nikos Paragios, and Iasonas Kokkinos. 2018. Deforming Autoencoders: Unsupervised Disentangling of Shape and Appearance. In European Conference on Computer Vision (ECCV).Google Scholar
52. V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhöfer. 2018. DeepVoxels: Learning Persistent 3D Feature Embeddings. arXiv:1812.01024 {cs.CV} (2018).Google Scholar
53. Richard Szeliski and Polina Golland. 1999. Stereo Matching with Transparency and Matting. International Journal of Computer Vision (IJCV) 32, 1 (1999). Google ScholarDigital Library
54. L. Torresani, A. Hertzmann, and C. Bregler. 2008. Nonrigid Structure-from-Motion: Estimating Shape and Motion with Hierarchical Priors. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 30, 5 (2008). Google ScholarDigital Library
55. Shubham Tulsiani, Alexei A. Efros, and Jitendra Malik. 2018. Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
56. Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Jitendra Malik. 2017. Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
57. Ali Osman Ulusoy, Andreas Geiger, and Michael J. Black. 2015. Towards Probabilistic Volumetric Reconstruction Using Ray Potentials. In 3DV.Google Scholar
58. Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. 2018. Deep Image Prior. In Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
59. Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. Video-to-Video Synthesis. In Advances in Neural Information Processing Systems (NeurIPS). Google ScholarDigital Library
60. Zexiang Xu, Hsiang-Tao Wu, Lvdi Wang, Changxi Zheng, Xin Tong, and Yue Qi. 2014. Dynamic Hair Capture Using Spacetime Optimization. ACM Trans. Graph. 33, 6 (2014). Google ScholarDigital Library
61. Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A Globally Optimal Algorithm for Robust TV-L¹ Range Image Integration. In International Conference on Computer Vision (ICCV).Google ScholarCross Ref
62. Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. 2018. Stereo Magnification: Learning View Synthesis using Multiplane Images. ACM Trans. Graph. 37, 4 (2018). Google ScholarDigital Library
63. Michael Zollhöfer, Matthias Nießner, Shahram Izadi, Christoph Rehmann, Christopher Zach, Matthew Fisher, Chenglei Wu, Andrew Fitzgibbon, Charles Loop, Christian Theobalt, and Marc Stamminger. 2014. Real-time Non-rigid Reconstruction Using an RGB-D Camera. ACM Trans. Graph. 33, 4 (2014). Google ScholarDigital Library

ACM Digital Library Publication:

Overview Page:

SIGGRAPH 2019: Technical Papers

“Neural volumes: learning dynamic renderable volumes from images” by Lombardi, Simon, Saragih, Schwartz, Lehrmann, et al. …

Conference:

Type(s):

Title:

Session/Category Title: Neural Rendering

Presenter(s)/Author(s):

Abstract:

References:

ACM Digital Library Publication:

Overview Page:

Sponsored by: