“Interactive augmented reality storytelling guided by scene semantics” by Li, Li, Huang and Yu

  • ©Changyang Li, Wanwan Li, Haikun Huang, and Lap-Fai Yu




    Interactive augmented reality storytelling guided by scene semantics



    We present a novel interactive augmented reality (AR) storytelling approach guided by indoor scene semantics. Our approach automatically populates virtual contents in real-world environments to deliver AR stories, which match both the story plots and scene semantics. During the storytelling process, a player can participate as a character in the story. Meanwhile, the behaviors of the virtual characters and the placement of the virtual items adapt to the player’s actions. An input raw story is represented as a sequence of events, which contain high-level descriptions of the characters’ states, and is converted into a graph representation with automatically supplemented low-level spatial details. Our hierarchical story sampling approach samples realistic character behaviors that fit the story contexts through optimizations; and an animator, which estimates and prioritizes the player’s actions, animates the virtual characters to tell the story in AR. Through experiments and a user study, we validated the effectiveness of our approach for AR storytelling in different environments.


    1. Henny Admoni and Brian Scassellati. 2017. Social eye gaze in human-robot interaction: a review. Journal of Human-Robot Interaction 6, 1 (2017), 25–63.Google ScholarDigital Library
    2. Shailen Agrawal and Michiel van de Panne. 2016. Task-based locomotion. ACM Transactions on Graphics 35, 4 (2016), 1–11.Google ScholarDigital Library
    3. Raphael Anderegg, Loïc Ciccone, and Robert W Sumner. 2018. PuppetPhone: pup-peteering virtual characters using a smartphone. In Proceedings of the 11th Annual International Conference on Motion, Interaction, and Games. 1–6.Google ScholarDigital Library
    4. Michael Argyle and Mark Cook. 1976. Gaze and mutual gaze. (1976).Google Scholar
    5. Andreas Aristidou, Joan Lasenby, Yiorgos Chrysanthou, and Ariel Shamir. 2018. Inverse kinematics techniques in computer graphics: A survey. In Computer graphics forum, Vol. 37. Wiley Online Library, 35–58.Google Scholar
    6. Yunfei Bai, Kristin Siu, and C Karen Liu. 2012. Synthesis of concurrent object manipulation tasks. ACM Transactions on Graphics 31, 6 (2012), 1–9.Google ScholarDigital Library
    7. Mark Billinghurst, Hirokazu Kato, and Ivan Poupyrev. 2001. The magicbook-moving seamlessly between reality and virtuality. IEEE Computer Graphics and Applications 21, 3 (2001), 6–8.Google ScholarDigital Library
    8. Zhe Cao, Hang Gao, Karttikeya Mangalam, Qi-Zhi Cai, Minh Vo, and Jitendra Malik. 2020. Long-term human motion prediction with scene context. In European Conference on Computer Vision. Springer, 387–404.Google ScholarDigital Library
    9. Justine Cassell and Kimiko Ryokai. 2001. Making space for voice: Technologies to support children’s fantasy and storytelling. Personal and ubiquitous computing 5, 3 (2001), 169–190.Google Scholar
    10. Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3D: Learning from RGB-D Data in Indoor Environments. International Conference on 3D Vision (2017).Google ScholarCross Ref
    11. Xiaojun Chang, Pengzhen Ren, Pengfei Xu, Zhihui Li, Xiaojiang Chen, and Alex Hauptmann.2021. Scene Graphs: A Survey of Generations and Applications. arXiv preprint arXiv:2104.01111 (2021).Google Scholar
    12. Long Chen, Wen Tang, Nigel John, Tao Ruan Wan, and Jian Jun Zhang. 2018. Context-aware mixed reality: A framework for ubiquitous interaction. arXiv preprint arXiv:1803.05541 (2018).Google Scholar
    13. Mengyu Chen, Andrés Monroy-Hernández, and Misha Sra. 2021. SceneAR: Scene-based Micro Narratives for Sharing and Remixing in Augmented Reality. In 2021 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 294–303.Google Scholar
    14. Yifei Cheng, Yukang Yan, Xin Yi, Yuanchun Shi, and David Lindlbauer. 2021. SemanticAdapt: Optimization-based Adaptation of Mixed Reality Layouts Leveraging Virtual-Physical Semantic Connections. In UIST. 282–297.Google Scholar
    15. Sung Ho Choi, Kyeong-Beom Park, Dong Hyeon Roh, Jae Yeol Lee, Mustafa Mohammed, Yalda Ghasemi, and Heejin Jeong. 2022. An integrated mixed reality system for safety-aware human-robot collaboration using deep learning and digital twin generation. Robotics and Computer-Integrated Manufacturing 73 (2022), 102258.Google ScholarDigital Library
    16. Zhi-Chao Dong, Wenming Wu, Zenghao Xu, Qi Sun, Guanjie Yuan, Ligang Liu, and Xiao-Ming Fu. 2021. Tailored Reality: Perception-aware Scene Restructuring for Adaptive VR Navigation. ACM Transactions on Graphics 40, 5 (2021), 1–15.Google ScholarDigital Library
    17. Matthew Fisher, Manolis Savva, and Pat Hanrahan. 2011. Characterizing structural relationships in scenes using graph kernels. In ACM SIGGRAPH 2011 papers. 1–12.Google ScholarDigital Library
    18. Matthew Fisher, Manolis Savva, Yangyan Li, Pat Hanrahan, and Matthias Nießner. 2015. Activity-centric scene synthesis for functional 3D scene modeling. ACM Transactions on Graphics 34, 6 (2015), 1–13.Google ScholarDigital Library
    19. Ran Gal, Lior Shapira, Eyal Ofek, and Pushmeet Kohli. 2014. FLARE: Fast layout for augmented reality applications. In 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 207–212.Google ScholarCross Ref
    20. Terrell Glenn, Ananya Ipsita, Caleb Carithers, Kylie Peppler, and Karthik Ramani. 2020. StoryMakAR: Bringing stories to life with an augmented reality & physical prototyping toolkit for youth. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–14.Google ScholarDigital Library
    21. Raphaël Grasset, Andreas Dünser, and Mark Billinghurst. 2008. Edutainment with a mixed reality book: a visually augmented illustrative childrens’ book. In Proceedings of the international conference on advances in computer entertainment technology. 292–295.Google ScholarDigital Library
    22. Theodore P Grosvenor. 2007. Primary care optometry. Elsevier Health Sciences.Google Scholar
    23. Abhinav Gupta, Scott Satkin, Alexei A Efros, and Martial Hebert. 2011. From 3d scene geometry to human workspace. In CVPR 2011. IEEE, 1961–1968.Google ScholarDigital Library
    24. Mohamed Hassan, Duygu Ceylan, Ruben Villegas, Jun Saito, Jimei Yang, Yi Zhou, and Michael J Black. 2021a. Stochastic scene-aware motion prediction. In ICCV. 11374–11384.Google Scholar
    25. Mohamed Hassan, Partha Ghosh, Joachim Tesch, Dimitrios Tzionas, and Michael J Black. 2021b. Populating 3D Scenes by Learning Human-Scene Interaction. In CVPR. 14708–14718.Google Scholar
    26. W Keith Hastings. 1970. Monte Carlo sampling methods using Markov chains and their applications. (1970).Google Scholar
    27. Fengming He, Xiyun Hu, Tianyi Wang, Ananya Ipsita, and Karthik Ramani. 2022. ScalAR: Authoring Semantically Adaptive Augmented Reality Experiences in Virtual Reality. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems.Google Scholar
    28. Chenfanfu Jiang, Siyuan Qi, Yixin Zhu, Siyuan Huang, Jenny Lin, Lap-Fai Yu, Demetri Terzopoulos, and Song-Chun Zhu. 2018. Configurable 3d scene synthesis and 2d image rendering with per-pixel ground truth using stochastic grammars. International Journal of Computer Vision 126, 9 (2018), 920–941.Google ScholarDigital Library
    29. Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In CVPR. 3668–3678.Google Scholar
    30. Vladimir G Kim, Siddhartha Chaudhuri, Leonidas Guibas, and Thomas Funkhouser. 2014. Shape2pose: Human-centric shape analysis. ACM Transactions on Graphics 33, 4 (2014), 1–12.Google ScholarDigital Library
    31. Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. 1983. Optimization by simulated annealing. Science 220, 4598 (1983), 671–680.Google Scholar
    32. Yining Lang, Wei Liang, and Lap-Fai Yu. 2019. Virtual agent positioning driven by scene semantics in mixed reality. In 2019 IEEE VR. IEEE, 767–775.Google Scholar
    33. Changyang Li, Haikun Huang, Jyh-Ming Lien, and Lap-Fai Yu. 2021. Synthesizing scene-aware virtual reality teleport graphs. ACM Transactions on Graphics 40, 6 (2021), 1–15.Google ScholarDigital Library
    34. Manyi Li, Akshay Gadi Patil, Kai Xu, Siddhartha Chaudhuri, Owais Khan, Ariel Shamir, Changhe Tu, Baoquan Chen, Daniel Cohen-Or, and Hao Zhang. 2019. Grains: Generative recursive autoencoders for indoor scenes. ACM Transactions on Graphics 38, 2 (2019), 1–16.Google ScholarDigital Library
    35. Wei Liang, Xinzhe Yu, Rawan Alghofaili, Yining Lang, and Lap-Fai Yu. 2021b. Scene-Aware Behavior Synthesis for Virtual Pets in Mixed Reality. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–12.Google ScholarDigital Library
    36. Zhihao Liang, Zhihao Li, Songcen Xu, Mingkui Tan, and Kui Jia. 2021a. Instance Segmentation in 3D Scenes using Semantic Superpoint Tree Networks. In ICCV. 2783–2792.Google Scholar
    37. David Lindlbauer, Anna Maria Feit, and Otmar Hilliges. 2019. Context-aware online adaptation of mixed reality interfaces. In UIST. 147–160.Google Scholar
    38. Rui Ma, Honghua Li, Changqing Zou, Zicheng Liao, Xin Tong, and Hao Zhang. 2016. Action-driven 3D indoor scene evolution. ACM Trans. Graph. 35, 6 (2016), 173–1.Google ScholarDigital Library
    39. Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. 1953. Equation of state calculations by fast computing machines. The Journal of Chemical Physics 21, 6 (1953), 1087–1092.Google ScholarCross Ref
    40. Microsoft. 2016. Fragments. www.microsoft.com/en-us/p/fragments/9nblggh5ggm8Google Scholar
    41. Benjamin Nuernberger, Eyal Ofek, Hrvoje Benko, and Andrew D Wilson. 2016. Snapto-reality: Aligning augmented reality to the real world. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. 1233–1244.Google ScholarDigital Library
    42. Sören Pirk, Vojtech Krs, Kaimo Hu, Suren Deepak Rajasekaran, Hao Kang, Yusuke Yoshiyasu, Bedrich Benes, and Leonidas J Guibas. 2017. Understanding and exploiting object interaction landscapes. ACM Trans. Graph. 36, 3 (2017), 1–14.Google ScholarDigital Library
    43. Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. 2018. Virtualhome: Simulating household activities via programs. In CVPR. 8494–8502.Google Scholar
    44. Siyuan Qi, Siyuan Huang, Ping Wei, and Song-Chun Zhu. 2017. Predicting human activities using stochastic grammar. In ICCV. 1164–1172.Google Scholar
    45. Siyuan Qi, Yixin Zhu, Siyuan Huang, Chenfanfu Jiang, and Song-Chun Zhu. 2018. Human-centric indoor scene synthesis using stochastic grammar. In CVPR. 5899–5908.Google Scholar
    46. Shuwen Qiu, Hangxin Liu, Zeyu Zhang, Yixin Zhu, and Song-Chun Zhu. 2020. Human-Robot Interaction in a Shared Augmented Reality Workspace. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 11413–11418.Google ScholarDigital Library
    47. Dariusz Rumiński and Krzysztof Walczak. 2013. Creation of interactive AR content on mobile devices. In International Conference on Business Information Systems. Springer, 258–269.Google ScholarCross Ref
    48. Manolis Savva, Angel X Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. 2014. SceneGrok: Inferring action maps in 3D environments. ACM Transactions on Graphics 33, 6 (2014), 1–10.Google ScholarDigital Library
    49. Manolis Savva, Angel X Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. 2016. Pigraphs: learning interaction snapshots from observations. ACM Transactions on Graphics 35, 4 (2016), 1–12.Google ScholarDigital Library
    50. Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. 2019. Neural state machine for character-scene interactions. ACM Transactions on Graphics 38, 6 (2019), 209–1.Google ScholarDigital Library
    51. Tomu Tahara, Takashi Seno, Gaku Narita, and Tomoya Ishikawa. 2020. Retargetable AR: Context-aware Augmented Reality in Indoor Scenes based on 3D Scene Graph. In 2020 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). IEEE, 249–255.Google ScholarCross Ref
    52. Julien Valentin, Vibhav Vineet, Ming-Ming Cheng, David Kim, Jamie Shotton, Push-meet Kohli, Matthias Nießner, Antonio Criminisi, Shahram Izadi, and Philip Torr. 2015. Semanticpaint: Interactive 3d labeling and learning at your fingertips. ACM Transactions on Graphics 34, 5 (2015), 1–17.Google ScholarDigital Library
    53. Jiashun Wang, Huazhe Xu, Jingwei Xu, Sifei Liu, and Xiaolong Wang. 2021a. Synthesizing long-term 3d human motion and interaction in 3d scenes. In CVPR. 9401–9411.Google Scholar
    54. Jingbo Wang, Sijie Yan, Bo Dai, and Dahua Lin. 2021b. Scene-aware generative network for human motion synthesis. In CVPR. 12206–12215.Google Scholar
    55. Tianyi Wang, Xun Qian, Fengming He, Xiyun Hu, Ke Huo, Yuanzhi Cao, and Karthik Ramani. 2020. CAPturAR: An augmented reality tool for authoring human-involved context-aware applications. In UIST. 328–341.Google Scholar
    56. Kun Xu, Kang Chen, Hongbo Fu, Wei-Lun Sun, and Shi-Min Hu. 2013. Sketch2Scene: Sketch-based co-retrieval and co-placement of 3D models. ACM Transactions on Graphics 32, 4 (2013), 1–15.Google ScholarDigital Library
    57. Hui Ye, Kin Chung Kwan, Wanchao Su, and Hongbo Fu. 2020. ARAnimator: in-situ character animation in mobile AR with user-defined motion gestures. ACM Transactions on Graphics 39, 4 (2020), 83–1.Google ScholarDigital Library
    58. Yibiao Zhao and Song-Chun Zhu. 2013. Scene parsing by integrating function, geometry and appearance models. In CVPR. 3119–3126.Google Scholar
    59. Jixuan Zhi, Lap-Fai Yu, and Jyh-Ming Lien. 2021. Designing Human-Robot Coexistence Space. IEEE Robotics and Automation Letters 6, 4 (2021), 7161–7168.Google ScholarCross Ref
    60. Zhiying Zhou, Adrian David Cheok, JiunHorng Pan, and Yu Li. 2004. Magic Story Cube: an interactive tangible interface for storytelling. In Proceedings of the International Conference on Advances in Computer Entertainment Technology. 364–365.Google ScholarDigital Library
    61. Song-Chun Zhu and David Mumford. 2007. A stochastic grammar of images. Now Publishers Inc.Google Scholar

ACM Digital Library Publication: