Spice-E: Structural Priors in 3D Diffusion Using Cross-Entity Attention

Etai Sella; Gal Fiebelman; Noam Atia; Hadar Averbuch-Elor

“Spice-E: Structural Priors in 3D Diffusion Using Cross-Entity Attention”

Next: “Spider-Man: Across The Spider-Verse – How... »

« Previous: “Spherical wavelets: efficiently representing...

Conference:

SIGGRAPH 2024

Type(s):

Technical Papers

Title:

Spice-E: Structural Priors in 3D Diffusion Using Cross-Entity Attention

Presenter(s)/Author(s):

Etai Sella

Gal Fiebelman

Noam Atia

Hadar Averbuch-Elor

Abstract:

Text-to-3D diffusion models can generate high-quality 3D shapes in seconds, but they are hard to control. In this work we introduce Spice-E ? a neural network that adds structural guidance to 3D diffusion models, allowing for solving a variety of 3D to 3D tasks with SOTA performance.

References:

[1]
Panos Achlioptas, Ian Huang, Minhyuk Sung, Sergey Tulyakov, and Leonidas Guibas. 2022. ChangeIt3D: Language-Assisted 3D Shape Edits and Deformations. In Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2.

[2]
Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. 2023b. MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing. arXiv preprint arXiv:2304.08465 (2023).

[3]
Tianshi Cao, Karsten Kreis, Sanja Fidler, Nicholas Sharp, and Kangxue Yin. 2023a. TexFusion: Synthesizing 3D Textures with Text-Guided Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4169?4181.

[4]
Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, 2015. Shapenet: An Information-Rich 3D Model Repository. arXiv preprint arXiv:1512.03012 (2015).

[5]
Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. 2023. Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1?10.

[6]
Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, and Matthias Nie?ner. 2023b. Text2Tex: Text-Driven Texture Synthesis via Diffusion Models. arXiv preprint arXiv:2303.11396 (2023).

[7]
Kevin Chen, Christopher B Choy, Manolis Savva, Angel X Chang, Thomas Funkhouser, and Silvio Savarese. 2019. Text2Shape: Generating shapes From Natural Language By Learning Joint Embeddings. In Computer Vision?ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2?6, 2018, Revised Selected Papers, Part III 14. Springer, 100?116.

[8]
Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. 2023a. Fantasia3D: Disentangling Geometry and Appearance for High-Quality Text-to-3D Content Creation. arXiv preprint arXiv:2303.13873 (2023).

[9]
Yongwei Chen, Rui Chen, Jiabao Lei, Yabin Zhang, and Kui Jia. 2022. Tango: Text-Driven Photorealistic and Robust 3D Stylization via Lighting Decomposition. arXiv preprint arXiv:2210.11277 (2022).

[10]
Yen-Chi Cheng, Hsin-Ying Lee, Sergey Tulyakov, Alexander G Schwing, and Liang-Yan Gui. 2023. SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4456?4465.

[11]
Zezhou Cheng, Menglei Chai, Jian Ren, Hsin-Ying Lee, Kyle Olszewski, Zeng Huang, Subhransu Maji, and Sergey Tulyakov. 2022. Cross-Modal 3D Shape Generation and Manipulation. In European Conference on Computer Vision. Springer, 303?321.

[12]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. InstructBLIP: Towards General-Purpose Vision-Language Models with Instruction Tuning. arxiv:2305.06500 [cs.CV]

[13]
Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, 2023a. Objaverse-xl: A Universe of 10M+ 3D Objects. arXiv preprint arXiv:2307.05663 (2023).

[14]
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. 2023b. Objaverse: A Universe of Annotated 3D Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13142?13153.

[15]
Yu Deng, Jiaolong Yang, and Xin Tong. 2021. Deformed Implicit Field: Modeling 3D Shapes with Learned Dense Correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10286?10296.

[16]
Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. 2021. Stylegan-nada: Clip-guided Domain Adaptation of Image Generators. arXiv preprint arXiv:2108.00946 (2021).

[17]
Ran Gal, Ariel Shamir, Tal Hassner, Mark Pauly, and Daniel Cohen-Or. 2007. Surface Reconstruction Using Local Shape Priors. In Symposium on Geometry Processing. 253?262.

[18]
Vignesh Ganapathi-Subramanian, Olga Diamanti, Soeren Pirk, Chengcheng Tang, Matthias Niessner, and Leonidas Guibas. 2018. Parsing Geometry using Structure-Aware Shape Templates. In 2018 International Conference on 3D Vision (3DV). IEEE, 672?681.

[19]
Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. 2022. GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images. Advances In Neural Information Processing Systems 35 (2022), 31841?31854.

[20]
William Gao, Noam Aigerman, Thibault Groueix, Vova Kim, and Rana Hanocka. 2023. TextDeformer: Geometry Manipulation using Text Guidance. In ACM SIGGRAPH 2023 Conference Proceedings. 1?11.

[21]
Stephan J Garbin, Marek Kowalski, Virginia Estellers, Stanislaw Szymanowicz, Shideh Rezaeifar, Jingjing Shen, Matthew Johnson, and Julien Valentin. 2022. VolTeMorph: Realtime, Controllable and Generalisable Animation of Volumetric Representations. arXiv preprint arXiv:2208.00949 (2022).

[22]
Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. 2023. Tokenflow: Consistent Diffusion Features for Consistent Video Editing. arXiv preprint arXiv:2307.10373 (2023).

[23]
Zekun Hao, Hadar Averbuch-Elor, Noah Snavely, and Serge Belongie. 2020. Dualsdf: Semantic Shape Manipulation Using a Two-Level Representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7631?7641.

[24]
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022a. Prompt-to-Prompt Image Editing with Cross Attention Control. arXiv preprint arXiv:2208.01626 (2022).

[25]
Amir Hertz, Or Perel, Raja Giryes, Olga Sorkine-Hornung, and Daniel Cohen-Or. 2022b. Spaghetti: Editing Implicit Shapes Through Part Aware Generation. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1?20.

[26]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. Advances in neural information processing systems 33 (2020), 6840?6851.

[27]
Ian Huang, Panos Achlioptas, Tianyi Zhang, Sergey Tulyakov, Minhyuk Sung, and Leonidas Guibas. 2022. LADIS: Language Disentanglement for 3D Shape Editing. arXiv preprint arXiv:2212.05011 (2022).

[28]
Takeo Igarashi, Tomer Moscovich, and John F Hughes. 2005. As-Rigid-as-Possible Shape Manipulation. ACM transactions on Graphics (TOG) 24, 3 (2005), 1134?1141.

[29]
Ajay Jain, Ben Mildenhall, Jonathan T Barron, Pieter Abbeel, and Ben Poole. 2022. Zero-Shot Text-Guided Object Generation with Dream Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 867?876.

[30]
Tomas Jakab, Richard Tucker, Ameesh Makadia, Jiajun Wu, Noah Snavely, and Angjoo Kanazawa. 2021. KeyPointDeformer: Unsupervised 3D Keypoint Discovery for Shape Control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12783?12792.

[31]
Heewoo Jun and Alex Nichol. 2023. Shap-E: Generating Conditional 3D Implicit Functions. arXiv preprint arXiv:2305.02463 (2023).

[32]
Juil Koo, Seungwoo Yoo, Minh Hieu Nguyen, and Minhyuk Sung. 2023. Salad: Part-level Latent Diffusion for 3D Shape Generation and Manipulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14441?14451.

[33]
Han-Hung Lee and Angel X Chang. 2022. Understanding Pure Clip Guidance for Voxel Grid NeRF Models. arXiv preprint arXiv:2209.15172 (2022).

[34]
John P Lewis, Matt Cordner, and Nickson Fong. 2023. Pose Space Deformation: A Unified Approach to Shape Interpolation and Skeleton-Driven Deformation. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 811?818.

[35]
Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. 2023. Magic3D: High-resolution Text-To-3D Content Creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 300?309.

[36]
Steven Liu, Xiuming Zhang, Zhoutong Zhang, Richard Zhang, Jun-Yan Zhu, and Bryan Russell. 2021. Editing Conditional Radiance Fields. In Proceedings of the IEEE/CVF international conference on computer vision. 5773?5783.

[37]
Zhengzhe Liu, Jingyu Hu, Ka-Hei Hui, Xiaojuan Qi, Daniel Cohen-Or, and Chi-Wing Fu. 2023. EXIM: A Hybrid Explicit-Implicit Representation for Text-Guided 3D Shape Generation. ACM Transactions on Graphics (TOG) 42, 6 (2023), 1?12.

[38]
Zhengzhe Liu, Yi Wang, Xiaojuan Qi, and Chi-Wing Fu. 2022. Towards Implicit Text-Guided 3D Shape Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17896?17906.

[39]
Marios Loizou, Siddhant Garg, Dmitry Petrov, Melinos Averkiou, and Evangelos Kalogerakis. 2023. Cross-Shape Attention for Part Segmentation of 3D Point Clouds. In Computer Graphics Forum, Vol. 42. Wiley Online Library, e14909.

[40]
Thalmann Magnenat, Richard Laperri?re, and Daniel Thalmann. 1988. Joint-dependent Local Deformations for Hand Animation and Object Grasping. Technical Report. Canadian Inf. Process. Soc.

[41]
Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, and Andrea Vedaldi. 2023. Realfusion: 360deg Reconstruction of any Object from a Single Image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8446?8455.

[42]
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. 2021. Sdedit: Guided Image Synthesis and Editing with Stochastic Differential Equations. arXiv preprint arXiv:2108.01073 (2021).

[43]
Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. 2023. Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12663?12673.

[44]
Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka. 2022. Text2Mesh: Text-Driven Neural Stylization for Meshes. In CVPR.

[45]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. Commun. ACM 65, 1 (2021), 99?106.

[46]
Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. 2022. Point-e: A System for Generating 3D Point Clouds from Complex Prompts. arXiv preprint arXiv:2212.08751 (2022).

[47]
Or Patashnik, Daniel Garibi, Idan Azuri, Hadar Averbuch-Elor, and Daniel Cohen-Or. 2023. Localizing Object-Level Shape Variations with Text-to-Image Diffusion Models. ICCV (2023).

[48]
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. 2022. DreamFusion: Text-to-3D using 2D Diffusion. arXiv preprint arXiv:2209.14988 (2022).

[49]
Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, 2023. Magic123: One Image to High-Quality 3D Object Generation using both 2D and 3D Diffusion Priors. arXiv preprint arXiv:2306.17843 (2023).

[50]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning.

[51]
Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, and Daniel Cohen-Or. 2023. TEXTure: Text-Guided Texturing of 3D Shapes. arXiv preprint arXiv:2302.01721 (2023).

[52]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj?rn Ommer. 2022. High-resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684?10695.

[53]
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. Dreambooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500?22510.

[54]
Aditya Sanghi, Hang Chu, Joseph G Lambourne, Ye Wang, Chin-Yi Cheng, Marco Fumero, and Kamal Rahimi Malekshan. 2022. Clip-Forge: Towards Zero-Shot Text-to-Shape Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18603?18613.

[55]
Aditya Sanghi, Rao Fu, Vivian Liu, Karl DD Willis, Hooman Shayani, Amir H Khasahmadi, Srinath Sridhar, and Daniel Ritchie. 2023. CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes From Natural Language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18339?18348.

[56]
Ruwen Schnabel, Patrick Degener, and Reinhard Klein. 2009. Completion and Reconstruction with Primitive Shapes. In Computer Graphics Forum, Vol. 28. Wiley Online Library, 503?512.

[57]
Etai Sella, Gal Fiebelman, Peter Hedman, and Hadar Averbuch-Elor. 2023. Vox-E: Text-guided Voxel Editing of 3D Objects. arXiv preprint arXiv:2303.12048 (2023).

[58]
Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. 2021. Deep Marching Tetrahedra: a Hybrid Representation for High-Resolution 3D Shape Synthesis. Advances in Neural Information Processing Systems 34 (2021), 6087?6101.

[59]
Chun-Yu Sun, Qian-Fang Zou, Xin Tong, and Yang Liu. 2019. Learning Adaptive Hierarchical Cuboid Abstractions of 3D Shape Collections. ACM Transactions on Graphics (TOG) 38, 6 (2019), 1?13.

[60]
Minhyuk Sung, Vladimir G Kim, Roland Angst, and Leonidas Guibas. 2015. Data-Driven Structural Priors for Shape Completion. ACM Transactions on Graphics (TOG) 34, 6 (2015), 1?11.

[61]
Jiapeng Tang, Lev Markhasin, Bi Wang, Justus Thies, and Matthias Nie?ner. 2022. Neural Shape Deformation Priors. Advances in Neural Information Processing Systems 35 (2022), 17117?17132.

[62]
Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. 2023. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation. arXiv preprint arXiv:2309.16653 (2023).

[63]
Konstantinos Tertikas, Despoina Paschalidou, Boxiao Pan, Jeong Joon Park, Mikaela Angelina Uy, Ioannis Emiris, Yannis Avrithis, and Leonidas Guibas. 2023. Generating Part-Aware Editable 3D Shapes without 3D Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4466?4478.

[64]
Shubham Tulsiani, Hao Su, Leonidas J Guibas, Alexei A Efros, and Jitendra Malik. 2017. Learning Shape Abstractions by Assembling Volumetric Primitives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2635?2643.

[65]
Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. 2023. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1921?1930.

[66]
Can Wang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. 2022. Clip-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3835?3844.

[67]
Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. 2023a. Score jacobian chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12619?12629.

[68]
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. 2023b. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. arXiv preprint arXiv:2305.16213 (2023).

[69]
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. 2023. Tune-a-Video: One-shot Tuning of Image Diffusion Models for Text-to-Video Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7623?7633.

[70]
Fanbo Xiang, Zexiang Xu, Milos Hasan, Yannick Hold-Geoffroy, Kalyan Sunkavalli, and Hao Su. 2021. Neutex: Neural Texture Mapping for Volumetric Neural Rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7119?7128.

[71]
Tianhan Xu and Tatsuya Harada. 2022. Deforming Radiance Fields with Cages. In Computer Vision?ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23?27, 2022, Proceedings, Part XXXIII. Springer, 159?175.

[72]
Bangbang Yang, Chong Bao, Junyi Zeng, Hujun Bao, Yinda Zhang, Zhaopeng Cui, and Guofeng Zhang. 2022. Neumesh: Learning Disentangled Neural Mesh-Based Implicit Field for Geometry and Texture Editing. In European Conference on Computer Vision. Springer, 597?614.

[73]
Kaizhi Yang and Xuejin Chen. 2021. Unsupervised Learning for Cuboid Shape Abstraction via Joint Segmentation from Point Clouds. ACM Transactions on Graphics (TOG) 40, 4 (2021), 1?11.

[74]
Taoran Yi, Jiemin Fang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. 2023. GaussianDreamer: Fast Generation from Text to 3D Gaussian Splatting with Point Cloud Priors. arXiv preprint arXiv:2310.08529 (2023).

[75]
Yu-Jie Yuan, Yang-Tian Sun, Yu-Kun Lai, Yuewen Ma, Rongfei Jia, and Lin Gao. 2022. NeRF-editing: Geometry Editing of Neural Radiance Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18353?18364.

[76]
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding Conditional Control to Text-To-Image Diffusion Mdels. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836?3847.

[77]
Xin-Yang Zheng, Hao Pan, Peng-Shuai Wang, Xin Tong, Yang Liu, and Heung-Yeung Shum. 2023. Locally Attentional SDF Diffusion for Controllable 3D Shape Generation. arXiv preprint arXiv:2305.04461 (2023).

[78]
Jingyu Zhuang, Chen Wang, Lingjie Liu, Liang Lin, and Guanbin Li. 2023. DreamEditor: Text-Driven 3D Scene Editing with Neural Fields. arXiv preprint arXiv:2306.13455 (2023).

ACM Digital Library Publication:

Spice-E: Structural Priors in 3D Diffusion Using Cross-Entity Attention

Overview Page:

SIGGRAPH 2024: Technical Papers

Submit a story:

If you would like to submit a story about this presentation, please contact us: historyarchives@siggraph.org

ACM SIGGRAPH HISTORY ARCHIVES