“Synthetic defocus and look-ahead autofocus for casual videography” by Zhang, Matzen, Nguyen, Yao, Zhang, et al. …

  • ©Xuaner (Cecilia) Zhang, Kevin Matzen, Vivien Nguyen, Dillon Yao, You Zhang, and Ren Ng

Conference:


Type:


Session Title:

    Image Science

Title:

    Synthetic defocus and look-ahead autofocus for casual videography

Presenter(s)/Author(s):



Abstract:


    In cinema, large camera lenses create beautiful shallow depth of field (DOF), but make focusing difficult and expensive. Accurate cinema focus usually relies on a script and a person to control focus in realtime. Casual videographers often crave cinematic focus, but fail to achieve it. We either sacrifice shallow DOF, as in smartphone videos; or we struggle to deliver accurate focus, as in videos from larger cameras. This paper is about a new approach in the pursuit of cinematic focus for casual videography. We present a system that synthetically renders refocusable video from a deep DOF video shot with a smartphone, and analyzes future video frames to deliver context-aware autofocus for the current frame. To create refocusable video, we extend recent machine learning methods designed for still photography, contributing a new dataset for machine training, a rendering model better suited to cinema focus, and a filtering solution for temporal coherence. To choose focus accurately for each frame, we demonstrate autofocus that looks at upcoming video frames and applies AI-assist modules such as motion, face, audio and saliency detection. We also show that autofocus benefits from machine learning and a large-scale video dataset with focus annotation, where we use our RVR-LAAF GUI to create this sizable dataset efficiently. We deliver, for example, a shallow DOF video where the autofocus transitions onto each person before she begins to speak. This is impossible for conventional camera autofocus because it would require seeing into the future.

References:


    1. Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016).Google Scholar
    2. Jonathan T Barron, Andrew Adams, YiChang Shih, and Carlos Hernández. 2015. Fast bilateral-space stereo for synthetic defocus. In CVPR.Google Scholar
    3. Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, and Stephen Gould. 2016. Dynamic image networks for action recognition. In CVPR.Google Scholar
    4. Qifeng Chen and Vladlen Koltun. 2016. Full flow: Optical flow estimation by global optimization over regular grids. In CVPR.Google Scholar
    5. Qifeng Chen and Vladlen Koltun. 2017. Photographic image synthesis with cascaded refinement networks. In ICCV.Google Scholar
    6. Paul E Debevec and Jitendra Malik. 2008. Recovering high dynamic range radiance maps from photographs. ACM Trans. on Graphics (TOG) (2008). Google ScholarDigital Library
    7. Gabriel Eilertsen, Joel Kronander, Gyorgy Denes, Rafal Mantiuk, and Jonas Unger. 2017a. HDR image reconstruction from a single exposure using deep CNNs. ACM Trans. on Graphics (TOG) (2017). Google ScholarDigital Library
    8. Gabriel Eilertsen, Joel Kronander, Gyorgy Denes, Rafał K Mantiuk, and Jonas Unger. 2017b. HDR image reconstruction from a single exposure using deep CNNs. ACM Trans. on Graphics (TOG) (2017). Google ScholarDigital Library
    9. Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubinstein. 2018. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM Trans. on Graphics (TOG) (2018). Google ScholarDigital Library
    10. G. D. Evangelidis and E. Z. Psarakis. 2008. Parametric Image Alignment Using Enhanced Correlation Coefficient Maximization. PAMI (2008). Google ScholarDigital Library
    11. Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In CVPR.Google Scholar
    12. R Fontaine. 2017. A survey of enabling technologies in successful consumer digital imaging products. In International Image Sensors workshop.Google Scholar
    13. Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. 2016. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In ECCV.Google Scholar
    14. Michaël Gharbi, Jiawen Chen, Jonathan T Barron, Samuel W Hasinoff, and Frédo Durand. 2017. Deep bilateral learning for real-time image enhancement. ACM Trans. on Graphics (TOG) (2017). Google ScholarDigital Library
    15. Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. 2017. Unsupervised monocular depth estimation with left-right consistency. In CVPR.Google Scholar
    16. Norman Goldberg. 1992. Camera technology: the dark side of the lens.Google Scholar
    17. Robert T Held, Emily A Cooper, James F O’brien, and Martin S Banks. 2010. Using blur to affect perceived distance and size. ACM Trans. on Graphics (TOG) (2010). Google ScholarDigital Library
    18. João F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. 2015. High-speed tracking with kernelized correlation filters. PAMI (2015).Google Scholar
    19. Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, Zhuowen Tu, and Philip Torr. 2017. Deeply supervised salient object detection with short connections. In CVPR.Google Scholar
    20. Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR.Google Scholar
    21. Aaron Isaksen, Leonard McMillan, and Steven J Gortler. 2000. Dynamically reparameterized light fields. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques. Google ScholarDigital Library
    22. Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In ECCV.Google Scholar
    23. Neel Joshi and Larry Zitnick. 2014. Micro-Baseline Stereo. Technical Report.Google Scholar
    24. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In CVPR. Google ScholarDigital Library
    25. Leonid Keselman, John Iselin Woodfill, Anders Grunnet-Jepsen, and Achintya Bhowmik. 2017. Intel realsense stereoscopic depth cameras. In CVPR Workshops.Google ScholarCross Ref
    26. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.Google Scholar
    27. Masahiro Kobayashi, Michiko Johnson, Yoichi Wada, Hiromasa Tsuboi, Hideaki Takada, Kenji Togo, Takafumi Kishi, Hidekazu Takahashi, Takeshi Ichikawa, and Shunsuke Inoue. 2016. A low noise and high sensitivity image sensor with imaging and phase-difference detection AF in all pixels. ITE Trans. on Media Technology and Applications (2016).Google Scholar
    28. Martin Kraus and Magnus Strengert. 2007. Depth-of-field rendering by pyramidal image processing. CGF (2007).Google Scholar
    29. Yevhen Kuznietsov, Jörg Stückler, and Bastian Leibe. 2017. Semi-supervised deep learning for monocular depth map prediction. In CVPR.Google Scholar
    30. Marc Levoy and Pat Hanrahan. 1996. Light Field Rendering. (1996).Google Scholar
    31. Marc Levoy and Yael Pritch. 2017. Portrait mode on the Pixel 2 and Pixel 2 XL smartphones.Google Scholar
    32. Zhengqi Li and Noah Snavely. 2018. MegaDepth: Learning Single-View Depth Prediction from Internet Photos. In CVPR.Google Scholar
    33. Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic. 2017. Unsupervised video summarization with adversarial lstm networks. In CVPR.Google Scholar
    34. George Mather. 1996. Image Blur as a Pictorial Depth Cue. Proc. Biological Sciences (1996).Google Scholar
    35. Atsushi Morimitsu, Isao Hirota, Sozo Yokogawa, Isao Ohdaira, Masao Matsumura, Hiroaki Takahashi, Toshio Yamazaki, Hideki Oyaizu, Yalcin Incesu, Muhammad Atif, et al. 2015. A 4M pixel full-PDAF CMOS image sensor with 1.58 μ m 2X 1 On-Chip Micro-Split-Lens technology. Technical Report.Google Scholar
    36. S. K. Nayar and Y. Nakagawa. 1994. Shape from focus. PAMI (1994). Google ScholarDigital Library
    37. Ren Ng, Marc Levoy, Mathieu Brédif, Gene Duval, Mark Horowitz, and Pat Hanrahan. 2005. Light Field Photography with a Hand-held Plenoptic Camera. Technical Report.Google Scholar
    38. Abhijit S Ogale, Cornelia Fermuller, and Yiannis Aloimonos. 2005. Motion segmentation using occlusions. PAMI (2005). Google ScholarDigital Library
    39. Andrew Owens and Alexei A Efros. 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. In CVPR.Google Scholar
    40. Jinsun Park, Yu-Wing Tai, Donghyeon Cho, and In So Kweon. 2017. A unified approach of multi-scale deep and hand-crafted features for defocus estimation. In CVPR.Google Scholar
    41. Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. 2017. Learning Features by Watching Objects Move. In CVPR.Google Scholar
    42. Alex Poms, Will Crichton, Pat Hanrahan, and Kayvon Fatahalian. 2018. Scanner: Efficient Video Analysis at Scale. (2018).Google ScholarDigital Library
    43. Michael Potmesil and Indranil Chakravarty. 1982. Synthetic image generation with a lens and aperture camera model. ACM Trans. on Graphics (TOG) (1982). Google ScholarDigital Library
    44. Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. 2016. Deepmatching: Hierarchical deformable dense matching. IJCV (2016). Google ScholarDigital Library
    45. Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A Dataset for Movie Description. In CVPR.Google Scholar
    46. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention.Google ScholarCross Ref
    47. Daniel Scharstein and Richard Szeliski. 2002. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV (2002). Google ScholarDigital Library
    48. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google Scholar
    49. Pratul P Srinivasan, Rahul Garg, Neal Wadhwa, Ren Ng, and Jonathan T Barron. 2018. Aperture Supervision for Monocular Depth Estimation. (2018).Google Scholar
    50. Meijun Sun, Ziqi Zhou, Qinghua Hu, Zheng Wang, and Jianmin Jiang. 2018. SG-FCN: A Motion and Memory-Based Deep Learning Model for Video Saliency Detection. IEEE Trans. on Cybernetics (2018).Google Scholar
    51. Jaeyong Sung, Colin Ponce, Bart Selman, and Ashutosh Saxena. 2012. Unstructured human activity detection from rgbd images. In ICRA.Google Scholar
    52. S. Suwajanakorn, C. Hernandez, and S. M. Seitz. 2015. Depth from focus with your mobile phone. In CVPR.Google Scholar
    53. Huixuan Tang, Scott Cohen, Brian L. Price, Stephen Schiller, and Kiriakos N. Kutulakos. 2017. Depth from Defocus in the Wild. In CVPR.Google Scholar
    54. Michael W. Tao, Sunil Hadap, Jitendra Malik, and Ravi Ramamoorthi. 2013. Depth from Combining Defocus and Correspondence Using light-Field Cameras. In ICCV. Google ScholarDigital Library
    55. Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. 2018. Tracking emerges by colorizing videos. In ECCV.Google Scholar
    56. Neal Wadhwa, Rahul Garg, David E. Jacobs, Bryan E. Feldman, Nori Kanazawa, Robert Carroll, Yair Movshovitz-Attias, Jonathan T. Barron, Yael Pritch, and Marc Levoy. 2018. Synthetic Depth-of-Field With A Single-Camera Mobile Phone. ACM Trans. on Graphics (TOG) (2018). Google ScholarDigital Library
    57. Lijun Wang, Xiaohui Shen, Jianming Zhang, Oliver Wang, Zhe Lin, Chih-Yao Hsieh, Sarah Kong, and Huchuan Lu. 2018b. DeepLens: shallow depth of field from a single image. ACM Trans. on Graphics (TOG) (2018). Google ScholarDigital Library
    58. Ting-Chun Wang, Jun-Yan Zhu, Nima Khademi Kalantari, Alexei A Efros, and Ravi Ramamoorthi. 2017. Light field video capture using a learning-based hybrid imaging system. ACM Trans. on Graphics (TOG) (2017). Google ScholarDigital Library
    59. Wenguan Wang, Jianbing Shen, Fang Guo, Ming-Ming Cheng, and Ali Borji. 2018a. Revisiting Video Saliency: A Large-scale Benchmark and a New Model. In CVPR.Google Scholar
    60. Bennett Wilburn, Neel Joshi, Vaibhav Vaish, Eino-Ville Talvala, Emilio Antunez, Adam Barth, Andrew Adams, Mark Horowitz, and Marc Levoy. 2005. High Performance Imaging Using Large Camera Arrays. (2005).Google Scholar
    61. Yang Yang, Haiting Lin, Zhan Yu, Sylvain Paris, and Jingyi Yu. 2016. Virtual DSLR: High Quality Dynamic Depth-of-Field Synthesis on Mobile Platforms. In Digital Photography and Mobile Imaging.Google Scholar
    62. Zhan Yu, Christopher Thorpe, Xuan Yu, Scott Grauer-Gray, Feng Li, and Jingyi Yu. 2011. Dynamic Depth of Field on Live Video Streams: A Stereo Solution. In CGI.Google Scholar
    63. Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. 2016. Video summarization with long short-term memory. In ECCV.Google Scholar
    64. Xuaner Zhang, Ren Ng, and Qifeng Chen. 2018. Single Image Reflection Removal with Perceptual Losses. In CVPR.Google Scholar
    65. Michael Zollhöfer, Patrick Stotko, Andreas Görlitz, Christian Theobalt, Matthias Nießner, Reinhard Klein, and Andreas Kolb. 2018. State of the Art on 3D Reconstruction with RGB-D Cameras. In Computer Graphics Forum.Google Scholar


ACM Digital Library Publication:



Overview Page: