“Transparent Image Layer Diffusion Using Latent Transparency”
Conference:
Type(s):
Title:
- Transparent Image Layer Diffusion Using Latent Transparency
Presenter(s)/Author(s):
Abstract:
We present an approach enabling large-scale pretrained latent diffusion models like Stable Diffusion to generate transparent images and layers.
References:
[1]
Ya??z Aksoy, Tun? Ozan Ayd?n, and Marc Pollefeys. 2017a. Designing Effective Inter-Pixel Information Flow for Natural Image Matting. In Proc. CVPR.
[2]
Ya??z Aksoy, Tun? Ozan Ayd?n, Marc Pollefeys, and Aljo?a Smoli?. 2016. Interactive High-Quality Green-Screen Keying via Color Unmixing. ACM Trans. Graph. 35, 5 (2016), 152:1–152:12.
[3]
Ya??z Aksoy, Tun? Ozan Ayd?n, Aljo?a Smoli?, and Marc Pollefeys. 2017b. Unmixing-Based Soft Color Segmentation for Image Manipulation. ACM Trans. Graph. 36, 2 (2017), 19:1–19:19.
[4]
Ya??z Aksoy, Tae-Hyun Oh, Sylvain Paris, Marc Pollefeys, and Wojciech Matusik. 2018. Semantic Soft Segmentation. ACM Trans. Graph. (Proc. SIGGRAPH) 37, 4 (2018), 72:1–72:13.
[5]
Omri Avrahami, Dani Lischinski, and Ohad Fried. 2022. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18208–18218.
[6]
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. 2022. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022).
[7]
cagliostrolab. 2024. animagine-xl-3.0. huggingface (2024).
[8]
Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. 2023. MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 22560–22570.
[9]
Guowei Chen, Yi Liu, Jian Wang, Juncai Peng, Yuying Hao, Lutao Chu, Shiyu Tang, Zewu Wu, Zeyu Chen, Zhiliang Yu, Yuning Du, Qingqing Dang, Xiaoguang Hu, and Dianhai Yu. 2022. PP-Matting: High-Accuracy Natural Image Matting.
[10]
Jianqi Chen, Yilan Zhang, Zhengxia Zou, Keyan Chen, and Zhenwei Shi. 2023. Dense Pixel-to-Pixel Harmonization via Continuous Image Representation. IEEE Transactions on Circuits and Systems for Video Technology (2023), 1–1.
[11]
Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, Zhixin Ling, Weiyuan Li, and Liqing Zhang. 2020. Dovenet: Deep image harmonization via domain verification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8394–8403.
[12]
Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. 2023. DiffEdit: Diffusion-based semantic image editing with mask guidance. In International Conference on Learning Representations (ICLR).
[13]
diffusers. 2024. stable-diffusion-xl-1.0-inpainting-0.1. diffusers (2024).
[14]
Zheng-Jun Du, Liang-Fu Kang, Jianchao Tan, Yotam Gingold, and Kun Xu. 2023. Image vectorization and editing via linear gradient layer decomposition. ACM Transactions on Graphics (TOG) 42, 4 (Aug. 2023).
[15]
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. 2022. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022).
[16]
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). arXiv:1412.6572
[17]
Julian Jorge Andrade Guerreiro, Mitsuru Nakazawa, and Bj?rn Stenger. 2023. PCT-Net: Full Resolution Image Harmonization Using Pixel-Wise Color Transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5917–5926.
[18]
Zonghui Guo, Haiyong Zheng, Yufeng Jiang, Zhaorui Gu, and Bing Zheng. 2021. Intrinsic image harmonization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16367–16376.
[19]
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2023. Prompt-to-Prompt Image Editing with Cross-Attention Control. In International Conference on Learning Representations (ICLR).
[20]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. NeurIPS.
[21]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685 (2021).
[22]
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. 2021. OpenCLIP.
[23]
Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. 2023. Imagic: Text-Based Real Image Editing with Diffusion Models. In Conference on Computer Vision and Pattern Recognition (CVPR).
[24]
Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. 2022a. DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2426–2435.
[25]
Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. 2022b. DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2416–2425.
[26]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll?r, and Ross Girshick. 2023. Segment Anything. arXiv:2304.02643 (2023).
[27]
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. 2023. Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation.
[28]
Zhifeng Kong and Wei Ping. 2021. On fast sampling of diffusion probabilistic models. CoRR 2106 (2021).
[29]
Yuki Koyama and Masataka Goto. 2018. Decomposing Images into Layers with Advanced Color Blending. Computer Graphics Forum 37, 7 (Oct. 2018), 397–407.
[30]
Jiachen Li, Jitesh Jain, and Humphrey Shi. 2023b. Matting Anything. arXiv: 2306.05399 (2023).
[31]
Pengzhi Li, QInxuan Huang, Yikang Ding, and Zhiheng Li. 2023a. LayerDiffusion: Layered Controlled Image Editing with Diffusion Models. arXiv:2305.18676 [cs.CV]
[32]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruction Tuning. In NeurIPS.
[33]
Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2023. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6038–6047.
[34]
Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. 2023. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. arXiv preprint arXiv:2302.08453 (2023).
[35]
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021).
[36]
Li Niu, Junyan Cao, Wenyan Cong, and Liqing Zhang. 2023. Deep Image Harmonization with Learnable Augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7482–7491.
[37]
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M?ller, Joe Penna, and Robin Rombach. 2023. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. (July 2023). arXiv:2307.01952 [cs.CV]
[38]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. (Oct. 2019). arXiv:1910.10683 [cs.LG]
[39]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).
[40]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj?rn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695.
[41]
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2022. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242 (2022).
[42]
Robin San-Roman, Eliya Nachmani, and Lior Wolf. 2021. Noise estimation for generative diffusion models. CoRR 2104 (2021).
[43]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
[44]
Christoph Schuhmann and Peter Bevan. 2023. LAION POP: 600,000 High-Resolution Images With Detailed Descriptions. https://huggingface.co/datasets/laion/laion-pop.
[45]
Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. CoRR 1503 (2015).
[46]
Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising diffusion implicit models. In ICLR. OpenReview.net.
[47]
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2020. Score-based generative modeling through stochastic differential equations. CoRR 2011 (2020), 13456.
[48]
Stability. 2022a. Stable Diffusion v1.5 Model Card, https://huggingface.co/runwayml/stable-diffusion-v1-5.
[49]
Stability. 2022b. Stable Diffusion v2 Model Card, Stable-Diffusion-2-Depth, https://huggingface.co/stabilityai/stable-diffusion-2-depth.
[50]
Jianchao Tan, Stephen DiVerdi, Jingwan Lu, and Yotam Gingold. 2019. Pigmento: Pigment-Based Image Analysis and Editing. Transactions on Visualization and Computer Graphics (TVCG) 25, 9 (2019).
[51]
Jianchao Tan, Marek Dvoro???k, Daniel S?kora, and Yotam Gingold. 2015. Decomposing Time-Lapse Paintings into Layers. ACM Transactions on Graphics (TOG) 34, 4, Article 61 (July 2015), 10 pages.
[52]
Jianchao Tan, Jose Echevarria, and Yotam Gingold. 2018. Efficient palette-based decomposition and recoloring of images via RGBXY-space geometry. ACM Transactions on Graphics (TOG) 37, 6, Article 262 (Dec. 2018), 10 pages.
[53]
Jianchao Tan, Jyh-Ming Lien, and Yotam Gingold. 2016. Decomposing Images into Layers via RGB-space Geometry. ACM Transactions on Graphics (TOG) 36, 1, Article 7 (Nov. 2016), 14 pages.
[54]
Linfeng Tan, Jiangtong Li, Li Niu, and Liqing Zhang. 2023. Deep image harmonization in dual color spaces. In Proceedings of the 31st ACM International Conference on Multimedia. 2159–2167.
[55]
Jingwei Tang, Ya??z Aksoy, Cengiz ?ztireli, Markus Gross, and Tun? Ozan Ayd?n. 2019. Learning-based Sampling for Natural Image Matting. In Proc. CVPR.
[56]
Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu, and Ming-Hsuan Yang. 2017. Deep image harmonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3789–3797.
[57]
Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. 2023. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1921–1930.
[58]
Menghan Xia, Xueting Liu, and Tien-Tsin Wong. 2018. Invertible Grayscale. ACM Transactions on Graphics (SIGGRAPH Asia 2018 issue) 37, 6 (Nov. 2018), 246:1–246:10.
[59]
Mingqing Xiao, Shuxin Zheng, Chang Liu, Yaolong Wang, Di He, Guolin Ke, Jiang Bian, Zhouchen Lin, and Tie-Yan Liu. 2020. Invertible Image Rescaling. Springer International Publishing, 126–144.
[60]
Ning Xu, Brian Price, Scott Cohen, and Thomas Huang. 2017. Deep Image Matting. (March 2017). arXiv:1703.03872 [cs.CV]
[61]
Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. 2022. Versatile diffusion: Text, images and variations all in one diffusion model. arXiv preprint arXiv:2211.08332 (2022).
[62]
Jingfeng Yao, Xinggang Wang, Shusheng Yang, and Baoyuan Wang. 2024. ViTMatte: Boosting image matting with pre-trained plain vision transformers. Information Fusion 103 (2024), 102091.
[63]
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. 2023. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. (2023).
[64]
Lvmin Zhang and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023).
[65]
Xinyang Zhang, Wentian Zhao, Xin Lu, and Jeff Chien. 2023. Text2Layer: Layered Image Generation using Latent Diffusion Model. arXiv:2307.09781 [cs.CV]
[66]
Jun-Yan Zhu, Philipp Krahenbuhl, Eli Shechtman, and Alexei A Efros. 2015. Learning a discriminative model for the perception of realism in composite images. In Proceedings of the IEEE International Conference on Computer Vision. 3943–3951.
[67]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In Computer Vision (ICCV), 2017 IEEE International Conference on.