iMapper: interaction-guided scene mapping from monocular videos

Next generation smart and augmented reality systems demand a computational understanding of monocular footage that captures humans in physical spaces to reveal plausible object arrangements and human-object interactions. Despite recent advances, both in scene layout and human motion analysis, the above setting remains challenging to analyze due to regular occlusions that occur between objects and human motions. We observe that the interaction between object arrangements and human actions is often strongly correlated, and hence can be used to help recover from these occlusions. We present iMapper, a data-driven method to identify such human-object interactions and utilize them to infer layouts of occluded objects. Starting from a monocular video with detected 2D human joint positions that are potentially noisy and occluded, we first introduce the notion of interaction-saliency as space-time snapshots where informative human-object interactions happen. Then, we propose a global optimization to retrieve and fit interactions from a database to the detected salient interactions in order to best explain the input video. We extensively evaluate the approach, both quantitatively against manually annotated ground truth and through a user study, and demonstrate that iMapper produces plausible scene layouts for scenes with medium to heavy occlusion. Code and data are available on the project page.

References:

1. Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. 1995. A limited memory algorithm for bound constrained optimization. In SISC.Google Scholar
2. Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In IEEE CVPR.Google Scholar
3. Ayan Chakrabarti, Jingyu Shao, and Greg Shakhnarovich. 2016. Depth from a Single Image by Harmonizing Overcomplete Local Network Predictions. In NIPS. Google ScholarDigital Library
4. Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3D: Learning from RGB-D Data in Indoor Environments. In 3DV.Google Scholar
5. Kang Chen, Yu-Kun Lai, Yu-Xin Wu, Ralph Martin, and Shi-Min Hu. 2014. Automatic Semantic Modeling of Indoor Scenes from Low-quality RGB-D Data Using Contextual Information. In ACM SIGGRAPH Asia. Google ScholarDigital Library
6. Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017a. ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. In IEEE CVPR.Google Scholar
7. Angela Dai, Matthias Nießner, Michael Zollöfer, Shahram Izadi, and Christian Theobalt. 2017b. BundleFusion: Real-time Globally Consistent 3D Reconstruction using On-the-fly Surface Re-integration. In ACM TOG. Google ScholarDigital Library
8. Luca Del Pero, Joshua Bowdish, Bonnie Kermgard, Emily Hartley, and Kobus Barnard. 2013. Understanding Bayesian Rooms Using Composite 3D Object Models. In IEEE CVPR. Google ScholarDigital Library
9. Vincent Delaitre, David F. Fouhey, Ivan Laptev, Josef Sivic, Abhinav Gupta, and Alexei A. Efros. 2012. Scene semantics from long-term observation of people. In ECCV. Google ScholarDigital Library
10. Matthew Fisher, Daniel Ritchie, Manolis Savva, Thomas Funkhouser, and Pat Hanrahan. 2012. Example-based Synthesis of 3D Object Arrangements. In ACM SIGGRAPH Asia. Google ScholarDigital Library
11. Matthew Fisher, Manolis Savva, and Pat Hanrahan. 2011. Characterizing structural relationships in scenes using graph kernels. In ACM SIGGRAPH. Google ScholarDigital Library
12. Matthew Fisher, Manolis Savva, Yangyan Li, Pat Hanrahan, and Matthias Nießner. 2015. Activity-centric Scene Synthesis for Functional 3D Scene Modeling. In ACM SIGGRAPH Asia. Google ScholarDigital Library
13. David F. Fouhey, Vincent Delaitre, Abhinav Gupta, Alexei A. Efros, Ivan Laptev, and Josef Sivic. 2012. People Watching: Human Actions as a Cue for Single-View Geometry. In ECCV. Google ScholarDigital Library
14. Barbara Frank, Michael Ruhnke, Maxim Tatarchenko, and Wolfram Burgard. 2015. 3D-reconstruction of indoor environments from human activity. In IEEE ICRA.Google Scholar
15. Lianrui Fu, Junge Zhang, and Kaiqi Huang. 2015. Beyond Tree Structure Models: A New Occlusion Aware Graphical Model for Human Pose Estimation. In IEEE ICCV. Google ScholarDigital Library
16. Qiang Fu, Xiaowu Chen, Xiaoyu Su, and Hongbo Fu. 2017a. Pose-Inspired Shape Synthesis and Functional Hybrid. In IEEE TVCG.Google Scholar
17. Qiang Fu, Xiaowu Chen, Xiaotian Wang, Sijia Wen, Bin Zhou, and Hongbo Fu. 2017b. Adaptive Synthesis of Indoor Scenes via Activity-Associated Object Relation Graphs. ACM SIGGRAPH Asia. Google ScholarDigital Library
18. Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. 2018. Detecting and Recognizing Human-Object Interactions. In IEEE CVPR.Google Scholar
19. Abhinav Gupta, Aniruddha Kembhavi, and Larry S. Davis. 2009. Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition. In IEEE PAMI. Google ScholarDigital Library
20. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In IEEE ICCV.Google Scholar
21. Ruizhen Hu, Oliver van Kaick, Bojian Wu, Hui Huang, Ariel Shamir, and Hao Zhang. 2016. Learning How Objects Function via Co-analysis of Interactions. In ACM TOG. Google ScholarDigital Library
22. Ruizhen Hu, Chenyang Zhu, Oliver van Kaick, Ligang Liu, Ariel Shamir, and Hao Zhang. 2015. Interaction Context (ICON): Towards a Geometric Functionality Descriptor. In ACM TOG. Google ScholarDigital Library
23. Chun-Hao Huang, Edmond Boyer, Nassir Navab, and Slobodan Ilic. 2014. Human Shape and Pose Tracking Using Keyframes. In IEEE CVPR. Google ScholarDigital Library
24. Jia-Bin Huang and Ming-Hsuan Yang. 2009. Estimating Human Pose from Occluded Images. In ACCV. Google ScholarDigital Library
25. Shi-Sheng Huang, Hongbo Fu, and Shi-Min Hu. 2016. Structure guided interior scene synthesis via graph matching. In Graphical Models. Google ScholarDigital Library
26. Moos Hueting, Pradyumna Reddy, Ersin Yumer, Vladimir G. Kim, Nathan Carr, and Niloy J. Mitra. 2018. SeeThrough: Finding Objects in Heavily Occluded Indoor Scene Images. In 3DV.Google Scholar
27. Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. 2016. DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model. In ECCV.Google Scholar
28. Hamid Izadinia, Qi Shan, and Steven M Seitz. 2017. IM2CAD. In CVPR.Google Scholar
29. Yun Jiang, Hema S. Koppula, and Ashutosh Saxena. 2016. Modeling 3D Environments Through Hidden Human Context. In IEEE PAMI. Google ScholarDigital Library
30. Changgu Kang and Sung-Hee Lee. 2017. Scene reconstruction and analysis from motion. In Graphical Models. Google ScholarDigital Library
31. Vladimir G. Kim, Siddhartha Chaudhuri, Leonidas Guibas, and Thomas Funkhouser. 2014. Shape2Pose: Human-Centric Shape Analysis. In ACM SIGGRAPH. Google ScholarDigital Library
32. Leonard Krasner. 2013. Environmental Design and Human Behavior. Elsevier.Google Scholar
33. Tianqiang Liu, Siddhartha Chaudhuri, Vladimir G. Kim, Qixing Huang, Niloy J. Mitra, and Thomas Funkhouser. 2014. Creating Consistent Scene Graphs Using a Probabilistic Grammar. In ACM SIGGRAPH Asia. Google ScholarDigital Library
34. Diogo C. Luvizon, David Picard, and Hedi Tabia. 2018. 2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning. In IEEE CVPR.Google Scholar
35. Rui Ma, Honghua Li, Changqing Zou, Zicheng Liao, Xin Tong, and Hao Zhang. 2016. Action-driven 3D Indoor Scene Evolution. In ACM SIGGRAPH Asia. Google ScholarDigital Library
36. Richard A. Newcombe, Dieter Fox, and Steven M. Seitz. 2015. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In IEEE CVPR.Google Scholar
37. Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. 2011. KinectFusion: Real-time dense surface mapping and tracking. In IEEE ISMAR. Google ScholarDigital Library
38. Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. 2017a. Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision. In 3DV.Google Scholar
39. Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. 2017b. VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera. In ACM SIGGRAPH. Google ScholarDigital Library
40. Liangliang Nan, Ke Xie, and Andrei Sharf. 2012. A Search-classify Approach for Cluttered Indoor Scene Understanding. In ACM SIGGRAPH Asia. Google ScholarDigital Library
41. Ulric Neisser. 1976. Environmental Design and Human Behavior. W. H. Freeman.Google Scholar
42. Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked Hourglass Networks for Human Pose Estimation. In ECCV.Google Scholar
43. Sören Pirk, Olga Diamanti, Boris Thibert, Danfei Xu, and Leonidas J. Guibas. 2017a. Shape-Aware Spatio-Temporal Descriptors for Interaction Classification. In IEEE ICIP.Google Scholar
44. Sören Pirk, Vojtech Krs, Kaimo Hu, Suren Deepak Rajasekaran, Hao Kang, Yusuke Yoshiyasu, Bedrich Benes, and Leonidas J. Guibas. 2017b. Understanding and Exploiting Object Interaction Landscapes. In ACM SIGGRAPH Asia. Google ScholarDigital Library
45. Patrick Poirson, Phil Ammirato, Cheng-Yang Fu, Wei Liu, Jana Kosecká, and Alexander C. Berg. 2016. Fast Single Shot Detection and Pose Estimation. In 3DV.Google Scholar
46. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In IEEE PAMI. Google ScholarDigital Library
47. Grégory Rogez, Philippe Weinzaepfel, and Cordelia Schmid. 2019. LCR-Net++: Multi-person 2D and 3D Pose Detection in Natural Images. In IEEE PAMI.Google Scholar
48. Scott Satkin and Martial Hebert. 2013. 3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding. In IEEE CVPR.Google Scholar
49. Manolis Savva, Angel X. Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. 2014. SceneGrok: Inferring Action Maps in 3D Environments. In ACM SIGGRAPH Asia. Google ScholarDigital Library
50. Manolis Savva, Angel X. Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. 2016. PiGraphs: Learning Interaction Snapshots from Observations. In ACM SIGGRAPH. Google ScholarDigital Library
51. Alexander G. Schwing, Sanja Fidler, Marc Pollefeys, and Raquel Urtasun. 2013. Box in the Box: Joint 3D Layout and Object Reasoning from Single Images. In IEEE ICCV. Google ScholarDigital Library
52. Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 2016. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In IEEE CVPR.Google Scholar
53. Tianjia Shao, Aron Monszpart, Youyi Zheng, Bongjin Koo, Weiwei Xu, Kun Zhou, and Niloy Mitra. 2014. Imagining the Unseen: Stability-based Cuboid Arrangements for Scene Understanding. In ACM SIGGRAPH Asia. Joint first authors.Google Scholar
54. Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, and Baining Guo. 2012. An Interactive Approach to Semantic Modeling of Indoor Scenes with an RGBD Camera. In ACM SIGGRAPH Asia. Google ScholarDigital Library
55. Bugra Tekin, Artem Rozantsev, Vincent Lepetit, and Pascal Fua. 2016. Direct Prediction of 3D Body Poses from Motion Compensated Sequences. In IEEE CVPR.Google Scholar
56. Denis Tomè, Chris Russell, and Lourdes Agapito. 2017. Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image. In IEEE CVPR.Google Scholar
57. Alexander Toshev and Christian Szegedy. 2014. DeepPose: Human Pose Estimation via Deep Neural Networks. In IEEE CVPR. Google ScholarDigital Library
58. Shubham Tulsiani, Saurabh Gupta, David Fouhey, Alexei A. Efros, and Jitendra Malik. 2018. Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene. In IEEE CVPR.Google Scholar
59. Timo von Marcard, Bodo Rosenhahn, Michael J. Black, and Gerard Pons-Moll. 2017. Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs. In CGF Eurographics. Google ScholarDigital Library
60. Kai Wang, Manolis Savva, Angel X Chang, and Daniel Ritchie. 2018. Deep convolutional priors for indoor scene synthesis. In ACM TOG. Google ScholarDigital Library
61. Ping Wei, Yibiao Zhao, Nanning Zheng, and Song-Chun Zhu. 2013. Modeling 4D Human-Object Interactions for Event and Object Recognition. In IEEE ICCV. Google ScholarDigital Library
62. Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In IEEE CVPR.Google Scholar
63. Xiaolin Wei and Jinxiang Chai. 2010. VideoMocap: Modeling Physically Realistic Human Motion from Monocular Video Sequences. In ACM TOG. Google ScholarDigital Library
64. Xiaolin Wei, Peizhao Zhang, and Jinxiang Chai. 2012. Accurate Realtime Full-body Motion Capture Using a Single Depth Camera. In ACM TOG. Google ScholarDigital Library
65. Kai Xu, Rui Ma, Hao Zhang, Chenyang Zhu, Ariel Shamir, Daniel Cohen-Or, and Hui Huang. 2014. Organizing Heterogeneous Scene Collections Through Contextual Focal Points. In ACM SIGGRAPH. Google ScholarDigital Library
66. Bangpeng Yao, Aditya Khosla, and Li Fei-Fei. 2011. Classifying Actions and Measuring Action Similarity by Modeling the Mutual Context of Objects and Human Poses. In ICML.Google Scholar
67. Yi-Ting Yeh, Lingfeng Yang, Matthew Watson, Noah D. Goodman, and Pat Hanrahan. 2012. Synthesizing Open Worlds with Constraints Using Locally Annealed Reversible Jump MCMC. In ACM SIGGRAPH. Google ScholarDigital Library
68. Hong-Bo Zhang, Qing Lei, Bi-Neng Zhong, Ji-Xiang Du, and JiaLin Peng. 2016. A Survey on Human Pose Estimation. In Intelligent Automation and Soft Computing.Google Scholar
69. Xi Zhao, Ruizhen Hu, Paul Guerrero, Niloy Mitra, and Taku Komura. 2016. Relationship Templates for Creating Scene Variations. In ACM SIGGRAPH Asia. Google ScholarDigital Library
70. Xi Zhao, He Wang, and Taku Komura. 2014. Indexing 3D Scenes Using the Interaction Bisector Surface. In ACM TOG. Google ScholarDigital Library
71. Xiaowei Zhou, Menglong Zhu, Spyridon Leonardos, Kosta Derpanis, and Kostas Daniilidis. 2016. Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video. In IEEE CVPR.Google Scholar

ACM Digital Library Publication:

Overview Page:

SIGGRAPH 2019: Technical Papers

“iMapper: interaction-guided scene mapping from monocular videos” by Monszpart, Guerrero, Ceylan, Yumer and Mitra

Conference:

Type(s):

Title:

Session/Category Title: Off the Deep End

Presenter(s)/Author(s):

Abstract:

References:

ACM Digital Library Publication:

Overview Page:

Sponsored by: