“Generating Activity Snippets by Learning Human-Scene Interactions” by Li and Yu

  • ©Changyang Li and Lap-Fai Yu




    Generating Activity Snippets by Learning Human-Scene Interactions

Session/Category Title: Character Animation: Interaction




    We present an approach to generate virtual activity snippets, which comprise sequenced keyframes of multi-character, multi-object interaction scenarios in 3D environments, by learning from recordings of human-scene interactions. The generation consists of two stages. First, we use a sequential deep graph generative model with a temporal module to iteratively generate keyframe descriptions, which represent abstract interactions using graphs, while preserving spatial-temporal relations through the activities. Second, we devise an optimization framework to instantiate the activity snippets in virtual 3D environments guided by the generated keyframe descriptions. Our approach optimizes the poses of character and object instances encoded by the graph nodes to satisfy the relations and constraints encoded by the graph edges. The instantiation process includes a coarse 2D optimization followed by a fine 3D optimization to effectively explore the complex solution space for placing and posing the instances. Through experiments and a perceptual study, we applied our approach to generate plausible activity snippets under different settings.


    1. Giuseppe Abrami, Alexander Henlein, Attila Kett, and Alexander Mehler. 2020. Text2scenevr: Generating hypertexts with vannotator as a pre-processing step for text2scene systems. In Proceedings of the 31st ACM Conference on Hypertext and Social Media. 177–186.
    2. Shailen Agrawal and Michiel van de Panne. 2016. Task-based locomotion. ACM Transactions on Graphics 35, 4 (2016), 1–11.
    3. Nikos Athanasiou, Mathis Petrovich, Michael J Black, and Gül Varol. 2022. TEACH: Temporal Action Composition for 3D Humans. arXiv preprint arXiv:2209.04066 (2022).
    4. Yunfei Bai, Kristin Siu, and C Karen Liu. 2012. Synthesis of concurrent object manipulation tasks. ACM Transactions on Graphics 31, 6 (2012), 1–9.
    5. Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition. 961–970.
    6. Zhe Cao, Hang Gao, Karttikeya Mangalam, Qi-Zhi Cai, Minh Vo, and Jitendra Malik. 2020. Long-term human motion prediction with scene context. In European Conference on Computer Vision. Springer, 387–404.
    7. Subramanian Chidambaram, Hank Huang, Fengming He, Xun Qian, Ana M Villanueva, Thomas S Redick, Wolfgang Stuerzlinger, and Karthik Ramani. 2021. Processar: An augmented reality-based tool to create in-situ procedural 2d/3d ar instructions. In Designing Interactive Systems Conference 2021. 234–249.
    8. Matthew Fisher, Daniel Ritchie, Manolis Savva, Thomas Funkhouser, and Pat Hanrahan. 2012. Example-based synthesis of 3D object arrangements. ACM Transactions on Graphics (TOG) 31, 6 (2012), 1–11.
    9. Matthew Fisher, Manolis Savva, Yangyan Li, Pat Hanrahan, and Matthias Nießner. 2015. Activity-centric scene synthesis for functional 3D scene modeling. ACM Transactions on Graphics 34, 6 (2015), 1–13.
    10. Qiang Fu, Xiaowu Chen, Xiaotian Wang, Sijia Wen, Bin Zhou, and Hongbo Fu. 2017. Adaptive synthesis of indoor scenes via activity-associated object relation graphs. ACM Transactions on Graphics (TOG) 36, 6 (2017), 1–13.
    11. Ghost Town Games. 2016. Overcooked. https://store.steampowered.com/app/448510/Overcooked/.
    12. Anindita Ghosh, Noshaba Cheema, Cennet Oguz, Christian Theobalt, and Philipp Slusallek. 2021. Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1396–1406.
    13. Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing for quantum chemistry. In International conference on machine learning. PMLR, 1263–1272.
    14. Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. 2020. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia. 2021–2029.
    15. Xiaojie Guo and Liang Zhao. 2022. A systematic survey on deep generative models for graph generation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
    16. Mohamed Hassan, Duygu Ceylan, Ruben Villegas, Jun Saito, Jimei Yang, Yi Zhou, and Michael J Black. 2021a. Stochastic scene-aware motion prediction. In ICCV. 11374–11384.
    17. Mohamed Hassan, Partha Ghosh, Joachim Tesch, Dimitrios Tzionas, and Michael J Black. 2021b. Populating 3D Scenes by Learning Human-Scene Interaction. In CVPR. 14708–14718.
    18. W Keith Hastings. 1970. Monte Carlo sampling methods using Markov chains and their applications. (1970).
    19. Ruizhen Hu, Zeyu Huang, Yuhan Tang, Oliver Van Kaick, Hao Zhang, and Hui Huang. 2020. Graph2plan: Learning floorplan generation from layout graphs. ACM Transactions on Graphics (TOG) 39, 4 (2020), 118–1.
    20. Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. 2017. The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155 (2017), 1–23.
    21. Ananya Ipsita, Levi Erickson, Yangzi Dong, Joey Huang, Alexa K Bushinski, Sraven Saradhi, Ana M Villanueva, Kylie A Peppler, Thomas S Redick, and Karthik Ramani. 2022. Towards Modeling of Virtual Reality Welding Simulators to Promote Accessible and Scalable Training. In CHI Conference on Human Factors in Computing Systems. 1–21.
    22. Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10236–10247.
    23. Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, and Song-chun Zhu. 2020. Lemma: A multi-view dataset for learning multi-agent multi-task activities. In European Conference on Computer Vision. Springer, 767–786.
    24. Wengong Jin, Regina Barzilay, and Tommi Jaakkola. 2018. Junction tree variational autoencoder for molecular graph generation. In International conference on machine learning. PMLR, 2323–2332.
    25. Hanseob Kim, Ghazanfar Ali, and Jae-In Hwang. 2021. ASAP: Auto-generating Story-board And Previz with Virtual Humans. In 2021 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). IEEE, 316–320.
    26. Vladimir G Kim, Siddhartha Chaudhuri, Leonidas Guibas, and Thomas Funkhouser. 2014. Shape2pose: Human-centric shape analysis. ACM Transactions on Graphics 33, 4 (2014), 1–12.
    27. Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. 1983. Optimization by simulated annealing. Science 220, 4598 (1983), 671–680.
    28. Kang Hoon Lee, Myung Geol Choi, and Jehee Lee. 2006. Motion patches: building blocks for virtual environments annotated with motion data. In ACM SIGGRAPH 2006 Papers. 898–906.
    29. Seunghwan Lee, Phil Sik Chang, and Jehee Lee. 2022. Deep Compliant Control. In ACM SIGGRAPH 2022 Conference Proceedings. Article 23, 9 pages.
    30. Ang Li, Meghana Thotakuri, David A Ross, João Carreira, Alexander Vostrikov, and Andrew Zisserman. 2020. The ava-kinetics localized human actions video dataset. arXiv preprint arXiv:2005.00214 (2020).
    31. Changyang Li, Wanwan Li, Haikun Huang, and Lap-Fai Yu. 2022. Interactive augmented reality storytelling guided by scene semantics. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–15.
    32. Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. 2018. Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324 (2018).
    33. Zelun Luo, Zane Durante, Linden Li, Wanze Xie, Ruochen Liu, Emily Jin, Zhuoyi Huang, Lun Yu Li, Jiajun Wu, Juan Carlos Niebles, et al. 2022. MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
    34. Zelun Luo, Wanze Xie, Siddharth Kapoor, Yiyun Liang, Michael Cooper, Juan Carlos Niebles, Ehsan Adeli, and Fei-Fei Li. 2021. MOMA: Multi-Object Multi-Actor Activity Parsing. Advances in Neural Information Processing Systems 34 (2021), 17939–17955.
    35. Rui Ma, Honghua Li, Changqing Zou, Zicheng Liao, Xin Tong, and Hao Zhang. 2016. Action-driven 3D indoor scene evolution. ACM Trans. Graph. 35, 6 (2016), 173–1.
    36. Paul Merrell, Eric Schkufza, Zeyang Li, Maneesh Agrawala, and Vladlen Koltun. 2011. Interactive furniture layout using interior design guidelines. ACM transactions on graphics (TOG) 30, 4 (2011), 1–10.
    37. Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. 1953. Equation of state calculations by fast computing machines. The Journal of Chemical Physics 21, 6 (1953), 1087–1092.
    38. Nawmal. 2019. Nawmal. https://www.nawmal.com/
    39. Mathis Petrovich, Michael J Black, and Gül Varol. 2021. Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10985–10995.
    40. Sören Pirk, Vojtech Krs, Kaimo Hu, Suren Deepak Rajasekaran, Hao Kang, Yusuke Yoshiyasu, Bedrich Benes, and Leonidas J Guibas. 2017. Understanding and exploiting object interaction landscapes. ACM Transactions on Graphics (TOG) 36, 3 (2017), 1–14.
    41. Manolis Savva, Angel X Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. 2014. SceneGrok: Inferring action maps in 3D environments. ACM Transactions on Graphics 33, 6 (2014), 1–10.
    42. Manolis Savva, Angel X Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. 2016. Pigraphs: learning interaction snapshots from observations. ACM Transactions on Graphics 35, 4 (2016), 1–12.
    43. Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2616–2625.
    44. Martin Simonovsky and Nikos Komodakis. 2018. Graphvae: Towards generation of small graphs using variational autoencoders. In International conference on artificial neural networks. Springer, 412–422.
    45. Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. 2019. Neural state machine for character-scene interactions. ACM Transactions on Graphics 38, 6 (2019), 209–1.
    46. Jiahui Sun, Wenming Wu, Ligang Liu, Wenjie Min, Gaofeng Zhang, and Liping Zheng. 2022. WallPlan: synthesizing floorplans by learning to generate wall graphs. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–14.
    47. Tomu Tahara, Takashi Seno, Gaku Narita, and Tomoya Ishikawa. 2020. Retargetable AR: Context-aware augmented reality in indoor scenes based on 3D scene graph. In 2020 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). IEEE, 249–255.
    48. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
    49. He Wang, Sören Pirk, Ersin Yumer, Vladimir G Kim, Ozan Sener, Srinath Sridhar, and Leonidas J Guibas. 2019b. Learning a Generative Model for Multi-Step Human-Object Interactions from Videos. In Computer Graphics Forum, Vol. 38. Wiley Online Library, 367–378.
    50. Jiashun Wang, Huazhe Xu, Jingwei Xu, Sifei Liu, and Xiaolong Wang. 2021a. Synthesizing long-term 3d human motion and interaction in 3d scenes. In CVPR. 9401–9411.
    51. Jingbo Wang, Sijie Yan, Bo Dai, and Dahua Lin. 2021b. Scene-aware generative network for human motion synthesis. In CVPR. 12206–12215.
    52. Kai Wang, Yu-An Lin, Ben Weissmann, Manolis Savva, Angel X Chang, and Daniel Ritchie. 2019a. Planit: Planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Transactions on Graphics (TOG) 38, 4 (2019), 1–15.
    53. Kai Wang, Manolis Savva, Angel X Chang, and Daniel Ritchie. 2018. Deep convolutional priors for indoor scene synthesis. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1–14.
    54. Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. 2022. Humanise: Language-conditioned human motion generation in 3d scenes. arXiv preprint arXiv:2210.09729 (2022).
    55. Sifan Ye, Yixing Wang, Jiaman Li, Dennis Park, C. Karen Liu, Huazhe Xu, and Jiajun Wu. 2022. Scene Synthesis from Human Motion. In SIGGRAPH Asia 2022 Conference Papers. Article 26, 9 pages.
    56. Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. 2018. Graphrnn: Generating realistic graphs with deep auto-regressive models. In International conference on machine learning. PMLR, 5708–5717.
    57. Lap-Fai Yu, Sai Kit Yeung, Chi-Keung Tang, Demetri Terzopoulos, Tony F. Chan, and Stanley Osher. 2011. Make it home: automatic optimization of furniture arrangement. ACM Trans. Graph. 30, 4 (2011), 86.
    58. Jia-Qi Zhang, Xiang Xu, Zhi-Meng Shen, Ze-Huan Huang, Yang Zhao, Yan-Pei Cao, Pengfei Wan, and Miao Wang. 2021. Write-An-Animation: High-level Text-based Animation Editing with Character-Scene Interaction. In Computer Graphics Forum, Vol. 40. Wiley Online Library, 217–228.
    59. Siwei Zhang, Yan Zhang, Qianli Ma, Michael J Black, and Siyu Tang. 2020c. PLACE: Proximity learning of articulation and contact in 3D environments. In 2020 International Conference on 3D Vision (3DV). IEEE, 642–651.
    60. Yan Zhang, Mohamed Hassan, Heiko Neumann, Michael J Black, and Siyu Tang. 2020a. Generating 3d people in scenes without people. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6194–6204.
    61. Zaiwei Zhang, Zhenpei Yang, Chongyang Ma, Linjie Luo, Alexander Huth, Etienne Vouga, and Qixing Huang. 2020b. Deep generative modeling for scene synthesis via hybrid representations. ACM Transactions on Graphics (TOG) 39, 2 (2020), 1–21.

ACM Digital Library Publication:

Overview Page: