“Language-driven synthesis of 3D scenes from scene databases”
Conference:
Type(s):
Title:
- Language-driven synthesis of 3D scenes from scene databases
Session/Category Title: Learning to compose & decompose
Presenter(s)/Author(s):
- Rui Ma
- Akshay Gadi Patil
- Matthew Fisher
- Manyi Li
- Soren Pirk
- Binh-Son Hua
- Sai-Kit Yeung
- Xin Tong
- Leonidas (Leo) J. Guibas
- Hao Zhang
Moderator(s):
Abstract:
We introduce a novel framework for using natural language to generate and edit 3D indoor scenes, harnessing scene semantics and text-scene grounding knowledge learned from large annotated 3D scene databases. The advantage of natural language editing interfaces is strongest when performing semantic operations at the sub-scene level, acting on groups of objects. We learn how to manipulate these sub-scenes by analyzing existing 3D scenes. We perform edits by first parsing a natural language command from the user and transforming it into a semantic scene graph that is used to retrieve corresponding sub-scenes from the databases that match the command. We then augment this retrieved sub-scene by incorporating other objects that may be implied by the scene context. Finally, a new 3D scene is synthesized by aligning the augmented sub-scene with the user’s current scene, where new objects are spliced into the environment, possibly triggering appropriate adjustments to the existing scene arrangement. A suggestive modeling interface with multiple interpretations of user commands is used to alleviate ambiguities in natural language. We conduct studies comparing our approach against both prior text-to-scene work and artist-made scenes and find that our method significantly outperforms prior work and is comparable to handmade scenes even when complex and varied natural sentences are used.
References:
1. Alan Agresti and Brent A Coull. 1998. Approximate is better than exact for interval estimation of binomial proportions. The American Statistician 52, 2 (1998), 119–126.Google Scholar
2. Angel Chang, Will Monroe, Manolis Savva, Christopher Potts, and Christopher D. Manning. 2015b. Text to 3D Scene Generation with Rich Lexical Grounding. In Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL-IJCNLP).Google Scholar
3. Angel X. Chang, Mihail Eric, Manolis Savva, and Christopher D. Manning. 2017. SceneSeer: 3D Scene Design with Natural Language. CoRR abs/1703.00050 (2017). http://arxiv.org/abs/1703.00050Google Scholar
4. Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. 2015a. ShapeNet: An Information-Rich 3D Model Repository. (2015).Google Scholar
5. Angel X. Chang, Manolis Savva, and Christopher D. Manning. 2014a. Interactive Learning of Spatial Knowledge for Text to 3D Scene Generation. In Proc. ACL Workshop on Interactive Language Learning, Visualization, and Interfaces (ILLVI).Google Scholar
6. Angel X. Chang, Manolis Savva, and Christopher D. Manning. 2014b. Learning Spatial Knowledge for Text to 3D Scene Generation. In Empirical Methods in Natural Language Processing (EMNLP).Google Scholar
7. Bob Coyne, Alex Klapheke, Masoud Rouhizadeh, Richard Sproat, and Daniel Bauer. 2012. Annotation Tools and Knowledge Representation for a Text-To-Scene System. In COLING. 679–694.Google Scholar
8. Bob Coyne and Richard Sproat. 2001. WordsEye: An Automatic Text-to-scene Conversion System. In Proc. of SIGGRAPH. 487–496. Google ScholarDigital Library
9. Matthew Fisher, Yangyan Li, Manolis Savva, Pat Hanrahan, and Matthias Nießner. 2015. Activity-centric Scene Synthesis for Functional 3D Scene Modeling. ACM Trans. on Graph. 34, 6 (2015), 212:1–10. Google ScholarDigital Library
10. Matthew Fisher, Daniel Ritchie, Manolis Savva, Thomas Funkhouser, and Pat Hanrahan. 2012. Example-based synthesis of 3D object arrangements. ACM Trans. on Graph. 31, 6 (2012), 135:1–11. Google ScholarDigital Library
11. Matthew Fisher, Manolis Savva, and Pat Hanrahan. 2011. Characterizing structural relationships in scenes using graph kernels. 30, 4 (2011), 34. Google ScholarDigital Library
12. S. Guadarrama, L. Riano, D. Golland, D. Göhring, Y. Jia, D. Klein, P. Abbeel, and T. Darrell. 2013. Grounding Spatial Relations for Human-Robot Interaction. In Proc. IEEE Int. Conf. on Intelligent Robots & Systems. 1640–1647.Google Scholar
13. Ruizhen Hu, Chenyang Zhu, Oliver van Kaick, Ligang Liu, Ariel Shamir, and Hao Zhang. 2015. Interaction Context (ICON): Towards a Geometric Functionality Descriptor. ACM Trans. on Graph. 34, 4 (2015), Article 83. Google ScholarDigital Library
14. Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. 2016. SceneNN: A Scene Meshes Dataset with aNNotations. In Proc. of 3D Vision.Google ScholarCross Ref
15. Yun Jiang, Marcus Lim, and Ashutosh Saxena. 2012. Learning Object Arrangements in 3D Scenes using Human Context. In Proc. Int. Conf. on Machine Learning (ICML). Google ScholarDigital Library
16. Young Min Kim, Niloy J. Mitra, Dong-Ming Yan, and Leonidas Guibas. 2012. Acquiring 3D Indoor Environments with Variability and Repetition. ACM Trans. on Graph. 31, 6 (2012), 138:1–138:11. Google ScholarDigital Library
17. Tianqiang Liu, Aaron Hertzmann, Wilmot Li, and Thomas Funkhouser. 2015. Style Compatibility for 3D Furniture Models. ACM Trans. on Graph. 34, 4, Article 85 (2015), 85:1–85:9 pages. Google ScholarDigital Library
18. Rui Ma, Honghua Li, Changqing Zou, Zicheng Liao, Xin Tong, and Hao Zhang. 2016. Action-Driven 3D Indoor Scene Evolution. ACM Trans. on Graph. 35, 6 (2016). Google ScholarDigital Library
19. Lucas Majerowicz, Ariel Shamir, Alla Sheffer, and Holger H. Hoos. 2014. Filling Your Shelves: Synthesizing Diverse Style-Preserving Artifact Arrangements. IEEE Trans. Visualization & Computer Graphics 20, 11 (2014), 1507–1518.Google ScholarCross Ref
20. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Association for Computational Linguistics (ACL) System Demonstrations. 55–60. http://www.aclweb.org/anthology/P/P14/P14-5010Google Scholar
21. Paul Merrell, Eric Schkufza, Zeyang Li, Maneesh Agrawala, and Vladlen Koltun. 2011. Interactive Furniture Layout Using Interior Design Guidelines. ACM Trans. on Graph. 30, 4 (2011), 87:1–10. Google ScholarDigital Library
22. Dipendra Kumar Misra, Jaeyong Sung, Kevin Lee, and Ashutosh Saxena. 2014. Tell Me Dave: Context-Sensitive Grounding of Natural Language to Manipulation Instructions. In Proc. of Robotics: Science and Systems.Google ScholarCross Ref
23. Zeinab Sadeghipour, Zicheng Liao, Ping Tan, and Hao Zhang. 2016. Learning 3D Scene Synthesis from Annotated RGB-D Images. Computer Graphics Forum (SGP) 35, 5 (2016).Google Scholar
24. Manolis Savva, Angel X. Chang, and Pat Hanrahan. 2015. Semantically-Enriched 3D Models for Common-sense Knowledge. CVPR 2015 Workshop on Functionality, Physics, Intentionality and Causality (2015).Google ScholarCross Ref
25. Manolis Savva, Angel X. Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. 2016. PiGraphs: Learning Interaction Snapshots from Observations. ACM Trans. on Graph. 35, 4 (2016). Google ScholarDigital Library
26. Steven M. Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. 2006. A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms. In IEEE CVPR. 519–528. Google ScholarDigital Library
27. Lee M. Seversky and Lijun Yin. 2006. Real-time Automatic 3D Scene Generation from Natural Language Voice and Text Descriptions. In Proc. of ACM International Conference on Multimedia. 61–64. Google ScholarDigital Library
28. Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, and Baining Guo. 2012. An interactive approach to semantic modeling of indoor scenes with an RGBD camera. ACM Trans. on Graph. 31, 6 (2012), 136:1–11. Google ScholarDigital Library
29. N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. 2012. Indoor segmentation and support inference from RGBD images. In ECCV. Google ScholarDigital Library
30. Greg Slabaugh, Bruce Culbertson, Tom Malzbender, and Ron Schafer. 2001. A Survey of Methods for Volumetric Scene Reconstruction from Photographs. In Proc. of Eurographics Conference on Volume Graphics. 81–101. Google ScholarDigital Library
31. Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. 2015. SUN RGB-D: A RGB-D scene understanding benchmark suite. In IEEE CVPR. 567–576.Google Scholar
32. Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. 2017. Semantic scene completion from a single depth image. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 190–198.Google ScholarCross Ref
33. Moritz Tenorth and Michael Beetz. 2013. KnowRob: A Knowledge Processing Infrastructure for Cognition-enabled Robots. Int. J. Rob. Res. 32, 5 (2013), 566–590. Google ScholarDigital Library
34. Kai Wang, Manolis Savva, Angel X. Chang, and Daniel Ritchie. 2018. Deep Convolutional Priors for Indoor Scene Synthesis. ACM Trans. on Graphics (Proc. of SIGGRAPH) 37, 4 (2018). Google ScholarDigital Library
35. Jianxiong Xiao. 2012. 3D Reconstruction is Not Just a Low-level Task: Retrospect and Survey. Technical Report. MIT9.S912: What is Intelligence?Google Scholar
36. Kai Xu, Rui Ma, Hao Zhang, Chenyang Zhu, Ariel Shamir, Daniel Cohen-Or, and Hui Huang. 2014. Organizing Heterogeneous Scene Collection through Contextual Focal Points. ACM Trans. on Graph. 33, 4 (2014), Article 35. Google ScholarDigital Library
37. Kai Xu, Yifei Shi, Lintao Zheng, Junyu Zhang, Min Liu, Hui Huang, Hao Su, Daniel Cohen-Or, and Baoquan Chen. 2016. 3D Attention-Driven Depth Acquisition for Object Identification. ACM Trans. on Graph. 35, 6 (2016). Google ScholarDigital Library
38. Lap-Fai Yu, Sai Kit Yeung, Chi-Keung Tang, Demetri Terzopoulos, Tony F. Chan, and Stanley Osher. 2011. Make it home: automatic optimization of furniture arrangement. ACM Trans. on Graph. 30, 4 (2011), 86:1–12. Google ScholarDigital Library
39. Lap-Fai Yu, Sai Kit Yeung, and Demetri Terzopoulos. 2016. The Clutterpalette: An Interactive Tool for Detailing Indoor Scenes. IEEE Trans. Visualization & Computer Graphics 22, 2 (2016), 1138–1148. Google ScholarDigital Library
40. C. Lawrence Zitnick, Devi Parikh, and Lucy Vanderwende. 2013. Learning the Visual Interpretation of Sentences. In Proc. ICCV. 1681–1688. Google ScholarDigital Library


