Animating Street View

We present a system that automatically brings street view imagery to life by populating it with naturally behaving, animated pedestrians and vehicles. Our approach is to remove existing people and vehicles from the input image, insert moving objects with proper scale, angle, motion and appearance, plan paths and traffic behavior, as well as render the scene with plausible occlusion and shadowing effects. The system achieves these by reconstructing the still image street scene, simulating crowd behavior, and rendering with consistent lighting, visibility, occlusions, and shadows. We demonstrate results on a diverse range of street scenes including regular still images and panoramas.

References:

[1]
Samaneh Azadi, Deepak Pathak, Sayna Ebrahimi, and Trevor Darrell. 2020. Compositional GAN: Learning Image-conditional Binary Composition. International Journal of Computer Vision (IJCV) 128 (2020), 2570–2585.

[2]
S. Farooq Bhat, I. Alhashim, and P. Wonka. 2021. AdaBins: Depth Estimation Using Adaptive Bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4008–4017.

[3]
Shih-Hsiu Chang, Ching-Ya Chiu, Chia-Sheng Chang, Kuo-Wei Chen, Chih-Yuan Yao, Ruen-Rone Lee, and Hung-Kuo Chu. 2018. Generating 360 Outdoor Panorama Dataset with Reliable Sun Position Estimation. In SIGGRAPH Asia Posters (Tokyo, Japan). Association for Computing Machinery, New York, NY, USA, Article 22, 2 pages.

[4]
Yun Chen, Frieda Rong, Shivam Duggal, Shenlong Wang, Xinchen Yan, Sivabalan Manivasagam, Shangjie Xue, Ersin Yumer, and Raquel Urtasun. 2021. GeoSim: Realistic Video Simulation via Geometry-Aware Composition for Self-Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]
Jui-Ting Chien, Chia-Jung Chou, Ding-Jie Chen, and Hwann-Tzong Chen. 2017. Detecting Nonexistent Pedestrians. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW).

[6]
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]
Aram Davtyan and Paolo Favaro. 2022. Controllable Video Generation Through Global and Local Motion Dynamics. In Proceedings of the European Conference on Computer Vision (ECCV). 68–84.

[8]
Dave Epstein, Taesung Park, Richard Zhang, Eli Shechtman, and Alexei A. Efros. 2022. BlobGAN: Spatially Disentangled Scene Representations. European Conference on Computer Vision (ECCV) (2022).

[9]
Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.

[10]
Yannick Hold-Geoffroy, Akshaya Athawale, and Jean-François Lalonde. 2019. Deep Sky Modeling for Single Image Outdoor Lighting Estimation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6920–6928.

[11]
Aleksander Holynski, Brian L. Curless, Steven M. Seitz, and Richard Szeliski. 2021. Animating Pictures With Eulerian Motion Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5810–5819.

[12]
Yan Hong, Li Niu, and Jianfu Zhang. 2022. Shadow Generation for Composite Image in Real-world Scenes. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).

[13]
Yaosi Hu, Chong Luo, and Zhenzhong Chen. 2022. Make It Move: Controllable Image-to-Video Generation With Text Descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18219–18228.

[14]
Yangyi Huang, Hongwei Yi, Weiyang Liu, Haofan Wang, Boxi Wu, Wenxiao Wang, Binbin Lin, Debing Zhang, and Deng Cai. 2022a. One-shot Implicit Animatable Avatars with Model-based Priors. arxiv:arXiv:2212.02469

[15]
Ziyuan Huang, Zhengping Zhou, Yung-Yu Chuang, Jiajun Wu, and C. Karen Liu. 2022b. Physically Plausible Animation of Human Upper Body from a Single Image. arxiv:2212.04741

[16]
Arthur Juliani, Vincent-Pierre Berges, Esh Vckay, Yuan Gao, Hunter Henry, Marwan Mattar, and Danny Lange. 2018. Unity: A General Platform for Intelligent Agents. arxiv:1809.02627

[17]
Johanna Karras, Aleksander Holynski, Ting-Chun Wang, and Ira Kemelmacher-Shlizerman. 2023. DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion. arxiv:arXiv:2304.06025

[18]
Kevin Karsch, Varsha Hedau, David Forsyth, and Derek Hoiem. 2011. Rendering Synthetic Objects into Legacy Photographs. ACM Trans. Graph. 30, 6 (dec 2011), 1–12. https://doi.org/10.1145/2070781.2024191

[19]
Kevin Karsch, Kalyan Sunkavalli, Sunil Hadap, Nathan Carr, Hailin Jin, Rafael Fonte, Michael Sittig, and David Forsyth. 2014. Automatic Scene Inference for 3D Object Compositing. ACM Trans. Graph. 33, 3, Article 32 (jun 2014), 15 pages. https://doi.org/10.1145/2602146

[20]
Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. 2021. DriveGAN: Towards a Controllable High-Quality Neural Simulation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]
Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming-Hsuan Yang, and Jan Kautz. 2018. Context-Aware Synthesis and Placement of Object Instances. In Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS). 10414–10424.

[22]
Donghoon Lee, Tomas Pfister, and Ming-Hsuan Yang. 2019. Inserting Videos Into Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]
Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey. 2018. ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9455–9464.

[24]
Daquan Liu, Chengjiang Long, Hongpan Zhang, Hanning Yu, Xinzhi Dong, and Chunxia Xiao. 2020. ARShadowGAN: Shadow Generative Adversarial Network for Augmented Reality in Single Light Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]
Timo Lüddecke and Alexander Ecker. 2022. Image Segmentation Using Text and Image Prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7086–7096.

[26]
Wei-Chiu Ma, Shenlong Wang, Marcus A. Brubaker, Sanja Fidler, and Raquel Urtasun. 2016. Find your way by observing the sun and other semantic cues. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA). 6292–6299.

[27]
Wan-Duo Kurt Ma, J. P. Lewis, W. Bastiaan Kleijn, and Thomas Leung. 2023. Directed Diffusion: Direct Control of Object Placement through Attention Guidance. arxiv:2302.13153

[28]
Arun Mallya, Ting-Chun Wang, and Ming-Yu Liu. 2022. Implicit Warping for Animation with Image Sets. In Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS).

[29]
Willi Menapace, Stéphane Lathuilière, Aliaksandr Siarohin, Christian Theobalt, Sergey Tulyakov, Vladislav Golyanik, and Elisa Ricci. 2022. Playable Environments: Video Manipulation in Space and Time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3584–3593.

[30]
Willi Menapace, Stephane Lathuiliere, Sergey Tulyakov, Aliaksandr Siarohin, and Elisa Ricci. 2021. Playable Video Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10061–10070.

[31]
Thu Nguyen-Phuoc, Christian Richardt, Long Mai, Yong-Liang Yang, and Niloy Mitra. 2020. BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images. In Advances in Neural Information Processing Systems 33.

[32]
Haomiao Ni, Changhao Shi, Kai Li, Sharon X. Huang, and Martin Renqiang Min. 2023. Conditional Image-to-Video Generation with Latent Flow Diffusion Models. arxiv:2303.13744

[33]
Michael Niemeyer and Andreas Geiger. 2021. GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).

[34]
Makoto Okabe, Ken Anjyor, and Rikio Onai. 2011. Creating Fluid Animation from a Single Image using Video Database. Computer Graphics Forum 30, 7 (2011), 1973–1982. https://doi.org/10.1111/j.1467-8659.2011.02062.x

[35]
Sida Peng, Junting Dong, Qianqian Wang, Shangzhan Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. 2021. Animatable Neural Radiance Fields for Modeling Dynamic Human Bodies. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 14314–14323.

[36]
Patrick Pérez, Michel Gangnet, and Andrew Blake. 2003. Poisson image editing. In ACM SIGGRAPH. Association for Computing Machinery, New York, NY, USA, 313–318.

[37]
A. Pumarola, A. Agudo, A.M. Martinez, A. Sanfeliu, and F. Moreno-Noguer. 2019. GANimation: One-Shot Anatomically Consistent Facial Animation. International Journal of Computer Vision (IJCV) (2019).

[38]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695.

[39]
Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Soo Ye Kim, and Daniel Aliaga. 2022. ObjectStitch: Generative Object Compositing. arxiv:2212.00932

[40]
Jin Sun, Hadar Averbuch-Elor, Qianqian Wang, and Noah Snavely. 2020. Hidden Footprints: Learning Contextual Walkability from 3D Human Trails. In Proceedings of the European Conference on Computer Vision (ECCV).

[41]
Jiajun Tang, Yongjie Zhu, Haoyu Wang, Jun Hoong Chan, Si Li, and Boxin Shi. 2022. Estimating Spatially-Varying Lighting In Urban Scenes With Disentangled Representation. In Proceedings of the European Conference on Computer Vision (ECCV). Springer Nature Switzerland, Cham, 454–469.

[42]
Adrien Treuille, Seth Cooper, and Zoran Popović. 2006. Continuum Crowds. ACM Transactions on Graphics 25, 3 (july 2006), 1160–1168. https://doi.org/10.1145/1141911.1142008

[43]
J.N. Tsitsiklis. 1995. Efficient Algorithms for Globally Optimal Trajectories. IEEE Trans. Automat. Control 40, 9 (1995), 1528–1538.

[44]
Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro. 2019. Few-shot Video-to-Video Synthesis. In Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS).

[45]
Yifan Wang, Brian L Curless, and Steven M Seitz. 2020. People as Scene Probes. In Proceedings of the European Conference on Computer Vision (ECCV).

[46]
Yifan Wang, Andrew Liu, Richard Tucker, Jiajun Wu, Brian L Curless, Steven M Seitz, and Noah Snavely. 2021. Repopulating Street Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]
Zian Wang, Wenzheng Chen, David Acuna, Jan Kautz, and Sanja Fidler. 2022. Neural Light Field Estimation for Street Scenes with Differentiable Virtual Object Insertion. In Proceedings of the European Conference on Computer Vision (ECCV). Springer Nature Switzerland, Cham, 380–397.

[48]
Chung-Yi Weng, Brian Curless, and Ira Kemelmacher. 2019. Photo Wake-Up: 3D Character Animation From a Single Photo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5901–5910.

[49]
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. 2021. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of Neural Information Processing Systems (NeurIPS).

[50]
Yinghao Xu, Menglei Chai, Zifan Shi, Sida Peng, Skorokhodov Ivan, Siarohin Aliaksandr, Ceyuan Yang, Yujun Shen, Hsin-Ying Lee, Bolei Zhou, and Tulyakov Sergy. 2022. DiscoScene: Spatially Disentangled Generative Radiance Field for Controllable 3D-aware Scene Synthesis. arxiv:2212.11984

[51]
Yang Xue, Yuheng Li, Krishna Kumar Singh, and Yong Jae Lee. 2022. GIRAFFE HD: A High-Resolution 3D-aware Generative Model. In CVPR.

[52]
Jae Shin Yoon, Lingjie Liu, Vladislav Golyanik, Kripasindhu Sarkar, Hyun Soo Park, and Christian Theobalt. 2021. Pose-Guided Human Animation from a Single Image in the Wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]
Wei Yu, Wenxin Chen, Songheng Yin, Steve Easterbrook, and Animesh Garg. 2022. Modular Action Concept Grounding in Semantic Video Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]
Fangneng Zhan, Hongyuan Zhu, and Shijian Lu. 2019. Spatial Fusion GAN for Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]
Haotian Zhang, Cristobal Sciutto, Maneesh Agrawala, and Kayvon Fatahalian. 2021. Vid2player: Controllable video sprites that behave and appear like professional tennis players. ACM Transactions on Graphics (TOG) 40, 3 (2021), 1–16.

[56]
Haotian Zhang, Ye Yuan, Viktor Makoviychuk, Yunrong Guo, Sanja Fidler, Xue Bin Peng, and Kayvon Fatahalian. 2023. Learning Physically Simulated Tennis Skills from Broadcast Videos. ACM Transactions on Graphics (TOG) (2023), 1–16.

[57]
Lei Zhu, Ke Xu, Zhanghan Ke, and Rynson W.H. Lau. 2021. Mitigating Intensity Bias in Shadow Detection via Feature Decomposition and Reweighting. In Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV). 4682–4691.

ACM Digital Library Publication:

Animating Street View

Overview Page:

SIGGRAPH Asia 2023: Technical Papers

Submit a story:

If you would like to submit a story about this presentation, please contact us: historyarchives@siggraph.org

ACM SIGGRAPH HISTORY ARCHIVES

“Animating Street View” by Shan, Curless, Kemelmacher-Shlizerman and Seitz

Conference:

Type(s):

Title:

Session/Category Title:

Presenter(s)/Author(s):

Abstract:

References:

ACM Digital Library Publication:

Overview Page:

Submit a story:

Sponsored by: