“Stereo magnification: learning view synthesis using multiplane images” by Flynn, Zhou, Tucker, Fyffe and Snavely

  • ©John Flynn, Tinghui Zhou, Richard Tucker, Graham Fyffe, and Noah Snavely

Conference:


Type:


Entry Number: 65

Title:

    Stereo magnification: learning view synthesis using multiplane images

Session/Category Title: Computational Photos and Videos


Presenter(s)/Author(s):


Moderator(s):



Abstract:


    The view synthesis problem—generating novel views of a scene from known imagery—has garnered recent attention due in part to compelling applications in virtual and augmented reality. In this paper, we explore an intriguing scenario for view synthesis: extrapolating views from imagery captured by narrow-baseline stereo cameras, including VR cameras and now-widespread dual-lens camera phones. We call this problem stereo magnification, and propose a learning framework that leverages a new layered representation that we call multiplane images (MPIs). Our method also uses a massive new data source for learning view extrapolation: online videos on YouTube. Using data mined from such videos, we train a deep network that predicts an MPI from an input stereo image pair. This inferred MPI can then be used to synthesize a range of novel views of the scene, including views that extrapolate significantly beyond the input baseline. We show that our method compares favorably with several recent view synthesis methods, and demonstrate applications in magnifying narrow-baseline stereo images.

References:


    1. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI. Google ScholarDigital Library
    2. Sameer Agarwal, Keir Mierle, and Others. 2016. Ceres Solver, http://ceres-solver.org. (2016).Google Scholar
    3. Apple. 2016. Portrait mode now available on iPhone 7 Plus with iOS 10.1. https://www.apple.com/newsroom/2016/10/portrait-mode-now-available-on-iphone-7-plus-with-ios-101/. (2016).Google Scholar
    4. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv.1607.06450 (2016).Google Scholar
    5. Alexandre Chapiro, Simon Heinzle, Tunç Ozan Aydin, Steven Poulakos, Matthias Zwicker, Aljosa Smolic, and Markus Gross. 2014. Optimizing stereo-to-multiview conversion for autostereoscopic displays. In Computer graphics forum. Google ScholarDigital Library
    6. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2018. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. on Pattern Analysis and Machine Intelligence 40, 4 (2018).Google ScholarCross Ref
    7. Qifeng Chen and Vladlen Koltun. 2017. Photographic image synthesis with cascaded refinement networks. In ICCV.Google Scholar
    8. Shenchang Eric Chen and Lance Williams. 1993. View Interpolation for Image Synthesis. In Proc. SIGGRAPH. Google ScholarDigital Library
    9. Paul E Debevec, Camillo J Taylor, and Jitendra Malik. 1996. Modeling and rendering architecture from photographs: A hybrid geometry-and image-based approach. In Proc. SIGGRAPH. Google ScholarDigital Library
    10. Piotr Didyk, Pitchaya Sitthi-Amorn, William Freeman, Frédo Durand, and Wojciech Matusik. 2013. Joint view expansion and filtering for automultiscopic 3D displays. In Proc. SIGGRAPH.Google ScholarDigital Library
    11. Alexey Dosovitskiy and Thomas Brox. 2016. Generating images with perceptual similarity metrics based on deep networks. In NIPS. Google ScholarDigital Library
    12. Jakob Engel, Vladlen Koltun, and Daniel Cremers. 2018. Direct sparse odometry. IEEE Trans. on Pattern Analysis and Machine Intelligence 40, 3 (2018).Google ScholarCross Ref
    13. John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. 2016. DeepStereo: Learning to Predict New Views From the World’s Imagery. In CVPR.Google Scholar
    14. Christian Forster, Matia Pizzoli, and Davide Scaramuzza. 2014. SVO: Fast Semi-Direct Monocular Visual Odometry. In ICRA.Google Scholar
    15. Ravi Garg and Ian Reid. 2016. Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. In ECCV.Google Scholar
    16. Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. 2017. Unsupervised Monocular Depth Estimation with Left-Right Consistency. In CVPR.Google Scholar
    17. Google. 2017a. Introducing VR180 cameras, https://vr.google.com/vr180/. (2017).Google Scholar
    18. Google. 2017b. Portrait mode on the Pixel 2 and Pixel 2 XL smartphones. https://research.googleblog.com/2017/10/portrait-mode-on-pixel-2-and-pixel-2-xl.html. (2017).Google Scholar
    19. Steven J. Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F. Cohen. 1996. The Lumigraph. In Proc. SIGGRAPH. Google ScholarDigital Library
    20. Hyowon Ha, Sunghoon Im, Jaesik Park, Hae-Gon Jeon, and In So Kweon. 2016. High-quality Depth from Uncalibrated Small Motion Clip. In CVPR.Google Scholar
    21. Richard Hartley and Andrew Zisserman. 2003. Multiple View Geometry in Computer Vision. Cambridge University Press. Google ScholarDigital Library
    22. Samuel W Hasinoff, Dillon Sharlet, Ryan Geiss, Andrew Adams, Jonathan T Barron, Florian Kainz, Jiawen Chen, and Marc Levoy. 2016. Burst photography for high dynamic range and low-light imaging on mobile cameras. In Proc. SIGGRAPH Asia.Google ScholarDigital Library
    23. Peter Hedman, Suhib Alsisan, Richard Szeliski, and Johannes Kopf. 2017. Casual 3D Photography. In Proc. SIGGRAPH Asia.Google ScholarDigital Library
    24. Michael Holroyd, Ilya Baran, Jason Lawrence, and Wojciech Matusik. 2011. Computing and fabricating multilayer models. In Proc. SIGGRAPH Asia. Google ScholarDigital Library
    25. Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial transformer networks. In NIPS. Google ScholarDigital Library
    26. Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In ECCV.Google Scholar
    27. Nima Khademi Kalantari, Ting-Chun Wang, and Ravi Ramamoorthi. 2016. Learning-Based View Synthesis for Light Field Cameras. In Proc. SIGGRAPH Asia.Google ScholarDigital Library
    28. Petr Kellnhofer, Piotr Didyk, Szu-Po Wang, Pitchaya Sitthi-Amorn, William Freeman, Fredo Durand, and Wojciech Matusik. 2017. 3DTV at Home: Eulerian-Lagrangian Stereo-to-Multiview Conversion. In Proc. SIGGRAPH.Google ScholarDigital Library
    29. Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
    30. Marc Levoy and Pat Hanrahan. 1996. Light Field Rendering. In Proc. SIGGRAPH. Google ScholarDigital Library
    31. Ziwei Liu, Raymond Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. 2017. Video Frame Synthesis Using Deep Voxel Flow. In ICCV.Google Scholar
    32. Lytro. 2018. Lytro. https://www.lytro.com/. (2018).Google Scholar
    33. Montiel J. M. M. Mur-Artal, Raúl and Juan D. Tardós. 2015. ORB-SLAM: a Versatile and Accurate Monocular SLAM System. IEEE Trans. on Robotics 31, 5 (2015).Google ScholarDigital Library
    34. Eric Penner and Li Zhang. 2017. Soft 3D Reconstruction for View Synthesis. In Proc. SIGGRAPH Asia.Google ScholarDigital Library
    35. Thomas Porter and Tom Duff. 1984. Compositing Digital Images. In Proc. SIGGRAPH. Google ScholarDigital Library
    36. Christian Riechert, Frederik Zilly, Peter Kauff, Jens Güther, and Ralf Schäfer. 2012. Fully automatic stereo-to-multiview conversion in autostereoscopic displays. The Best of IET and IBC 4 (09 2012).Google Scholar
    37. Johannes Lutz Schönberger and Jan-Michael Frahm. 2016. Structure-from-Motion Revisited. In CVPR.Google Scholar
    38. Jonathan Shade, Steven Gortler, Li-wei He, and Richard Szeliski. 1998. Layered depth images. In Proc. SIGGRAPH. Google ScholarDigital Library
    39. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
    40. Pratul P. Srinivasan, Tongzhou Wang, Ashwin Sreelal, Ravi Ramamoorthi, and Ren Ng. 2017. Learning to Synthesize a 4D RGBD Light Field from a Single Image. In ICCV.Google Scholar
    41. Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Jitendra Malik. 2017. Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency. In CVPR.Google Scholar
    42. Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and Katerina Fragkiadaki. 2017. Sfm-net: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804 (2017).Google Scholar
    43. John YA Wang and Edward H Adelson. 1994. Representing moving images with layers. IEEE Trans. on Image Processing 3, 5 (1994). Google ScholarDigital Library
    44. Zhou Wang, Alan Bovik, Hamid Sheikh, and Eero Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Trans. on Image Processing 13, 4 (2004). Google ScholarDigital Library
    45. Sven Wanner, Stephan Meister, and Bastian Goldluecke. 2013. Datasets and benchmarks for densely sampled 4d light fields. In VMV.Google Scholar
    46. G. Wetzstein, D. Lanman, W Heidrich, and R. Raskar. 2011. Layered 3D: Tomographic Image Synthesis for Attenuation-based Light Field and High Dynamic Range Displays. In Proc. SIGGRAPH. Google ScholarDigital Library
    47. Wikipedia. 2017. Multiplane camera. https://en.wikipedia.org/wiki/Multiplane_camera. (2017).Google Scholar
    48. Junyuan Xie, Ross B. Girshick, and Ali Farhadi. 2016. Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks. In ECCV.Google Scholar
    49. Fisher Yu and David Gallup. 2014. 3D Reconstruction from Accidental Motion. In CVPR. Google ScholarDigital Library
    50. Fisher Yu and Vladlen Koltun. 2016. Multi-Scale Context Aggregation by Dilated Convolutions. In ICLR.Google Scholar
    51. Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Networks as a Perceptual Metric. In CVPR.Google Scholar
    52. Zhoutong Zhang, Yebin Liu, and Qionghai Dai. 2015. Light field from micro-baseline image pair. In CVPR.Google Scholar
    53. Tinghui Zhou, Matthew Brown, Noah Snavely, and David Lowe. 2017. Unsupervised learning of depth and ego-motion from video. In CVPR.Google Scholar
    54. Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros. 2016. View synthesis by appearance flow. In ECCV.Google Scholar
    55. C. Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and Richard Szeliski. 2004. High-quality Video View Interpolation Using a Layered Representation. In Proc. SIGGRAPH. Google ScholarDigital Library


ACM Digital Library Publication:



Overview Page: