“StyleGAN-NADA: CLIP-guided domain adaptation of image generators” by Gal, Patashnik, Maron, Bermano, Chechik, et al. …

  • ©Rinon Gal, Or Patashnik, Haggai Maron, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-Or

Conference:


Type:


Title:

    StyleGAN-NADA: CLIP-guided domain adaptation of image generators

Presenter(s)/Author(s):



Abstract:


    Can a generative model be trained to produce images from a specific domain, guided only by a text prompt, without seeing any image? In other words: can an image generator be trained “blindly”? Leveraging the semantic power of large scale Contrastive-Language-Image-Pre-training (CLIP) models, we present a text-driven method that allows shifting a generative model to new domains, without having to collect even a single image. We show that through natural language prompts and a few minutes of training, our method can adapt a generator across a multitude of domains characterized by diverse styles and shapes. Notably, many of these modifications would be difficult or infeasible to reach with existing methods. We conduct an extensive set of experiments across a wide range of domains. These demonstrate the effectiveness of our approach, and show that our models preserve the latent-space structure that makes generative models appealing for downstream tasks. Code and videos available at: stylegan-nada.github.io/

References:


    1. 1990. Partitioning Around Medoids (Program PAM). John Wiley and Sons, Ltd, Chapter 2, 68–125. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470316801.ch2 Google ScholarCross Ref
    2. Rameen Abdal, Yipeng Qin, and Peter Wonka. 2019. Image2stylegan: How to embed images into the stylegan latent space?. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4432–4441.Google ScholarCross Ref
    3. Rameen Abdal, Peihao Zhu, Niloy J. Mitra, and Peter Wonka. 2021. StyleFlow: Attribute-Conditioned Exploration of StyleGAN-Generated Images Using Conditional Continuous Normalizing Flows. ACM Trans. Graph. 40, 3, Article 21 (May 2021), 21 pages. Google ScholarDigital Library
    4. Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. 2021a. Only a Matter of Style: Age Transformation Using a Style-Based Regression Model. arXiv:2102.02754 [cs.CV]Google Scholar
    5. Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. 2021b. ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement. arXiv preprint arXiv:2104.02699 (2021).Google Scholar
    6. David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, and Antonio Torralba. 2021. Paint by Word. arXiv:2103.10951 [cs.CV]Google Scholar
    7. Andrew Brock, Jeff Donahue, and Karen Simonyan. 2018. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018).Google Scholar
    8. Tian Qi Chen and Mark Schmidt. 2016. Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337 (2016).Google Scholar
    9. Yen-Chun Chen, Linjie Li, Licheng Yu, A. E. Kholy, Faisal Ahmed, Zhe Gan, Y. Cheng, and Jing jing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In ECCV.Google Scholar
    10. Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. 2020. StarGAN v2: Diverse Image Synthesis for Multiple Domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
    11. Katherine Crowson. 2021. VQGAN + CLIP. https://colab.research.google.com/drive/1L8oL-vLJXVcRzCFbPwOoMkPKJ8-aYdPN.Google Scholar
    12. Karan Desai and J. Johnson. 2020. VirTex: Learning Visual Representations from Textual Annotations. ArXiv abs/2006.06666 (2020).Google Scholar
    13. Karl Pearson F.R.S. 1901. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 11 (1901), 559–572. Google ScholarCross Ref
    14. Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2414–2423.Google ScholarCross Ref
    15. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014a. Generative adversarial nets. Advances in neural information processing systems 27 (2014).Google Scholar
    16. Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014b. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).Google Scholar
    17. Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. 2020. GANSpace: Discovering Interpretable GAN Controls. arXiv preprint arXiv:2004.02546 (2020).Google Scholar
    18. Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision. 1501–1510.Google ScholarCross Ref
    19. Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision. Springer, 694–711.Google ScholarCross Ref
    20. Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. 2020a. Training Generative Adversarial Networks with Limited Data. In Proc. NeurIPS.Google Scholar
    21. Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2021. Alias-Free Generative Adversarial Networks. arXiv:2106.12423 [cs.CV]Google Scholar
    22. Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4401–4410.Google ScholarCross Ref
    23. Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020b. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8110–8119.Google ScholarCross Ref
    24. Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. 2017. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4681–4690.Google ScholarCross Ref
    25. Gen Li, N. Duan, Yuejian Fang, Daxin Jiang, and M. Zhou. 2020a. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. In Proc. AAAI.Google Scholar
    26. Liunian Harold Li, Mark Yatskar, Da Yin, C. Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A Simple and Performant Baseline for Vision and Language. ArXiv abs/1908.03557 (2019).Google Scholar
    27. Xiujun Li, Xi Yin, C. Li, X. Hu, Pengchuan Zhang, Lei Zhang, Longguang Wang, H. Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020b. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In ECCV.Google Scholar
    28. Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2017. Universal style transfer via feature transforms. Advances in neural information processing systems 30 (2017).Google Scholar
    29. Yijun Li, Richard Zhang, Jingwan Lu, and Eli Shechtman. 2020c. Few-shot image generation with elastic weight consolidation. arXiv preprint arXiv:2012.02780 (2020).Google Scholar
    30. Jing Liao, Yuan Yao, Lu Yuan, Gang Hua, and Sing Bing Kang. 2017. Visual Attribute Transfer through Deep Image Analogy. 36, 4 (2017). Google ScholarDigital Library
    31. Bingchen Liu, Yizhe Zhu, Kunpeng Song, and Ahmed Elgammal. 2020. Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis. In International Conference on Learning Representations.Google Scholar
    32. Songhua Liu, Tianwei Lin, Dongliang He, Fu Li, Meiling Wang, Xin Li, Zhengxing Sun, Qian Li, and Errui Ding. 2021. AdaAttN: Revisit Attention Mechanism in Arbitrary Neural Style Transfer. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 6629–6638.Google Scholar
    33. Jiasen Lu, Dhruv Batra, D. Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In NeurIPS.Google Scholar
    34. Sangwoo Mo, Minsu Cho, and Jinwoo Shin. 2020. Freeze the Discriminator: a Simple Baseline for Fine-Tuning GANs. In CVPR AI for Content Creation Workshop.Google Scholar
    35. Ryan Murdock. 2021. The Big Sleep. https://twitter.com/advadnoun/status/1351038053033406468.Google Scholar
    36. Yotam Nitzan, Rinon Gal, Ofir Brenner, and Daniel Cohen-Or. 2021. LARGE: Latent-Based Regression through GAN Semantics. arXiv:2107.11186 [cs.CV]Google Scholar
    37. Atsuhiro Noguchi and Tatsuya Harada. 2019. Image generation from small datasets via batch statistics adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2750–2758.Google ScholarCross Ref
    38. Utkarsh Ojha, Yijun Li, Jingwan Lu, Alexei A Efros, Yong Jae Lee, Eli Shechtman, and Richard Zhang. 2021. Few-shot Image Generation via Cross-domain Correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10743–10752.Google ScholarCross Ref
    39. Dae Young Park and Kwang Hee Lee. 2019. Arbitrary style transfer with style-attentional networks. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5880–5888.Google ScholarCross Ref
    40. Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. arXiv preprint arXiv:2103.17249 (2021).Google Scholar
    41. Justin NM Pinkney and Doron Adler. 2020. Resolution Dependent GAN Interpolation for Controllable Image Synthesis Between Domains. arXiv preprint arXiv:2010.05334 (2020).Google Scholar
    42. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021).Google Scholar
    43. Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. 2020. Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation. arXiv preprint arXiv:2008.00951 (2020).Google Scholar
    44. Esther Robb, Wen-Sheng Chu, Abhishek Kumar, and Jia-Bin Huang. 2020. Few-Shot Adaptation of Generative Adversarial Networks. ArXiv abs/2010.11943 (2020).Google Scholar
    45. Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus. 2020. Learning visual representations with caption annotations. arXiv preprint arXiv:2008.01392 (2020).Google Scholar
    46. Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. 2020. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9243–9252.Google ScholarCross Ref
    47. Lu Sheng, Ziyi Lin, Jing Shao, and Xiaogang Wang. 2018. Avatar-net: Multi-scale zero-shot style transfer by feature decoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8242–8250.Google ScholarCross Ref
    48. Guoxian Song, Linjie Luo, Jing Liu, Wan-Chun Ma, Chunpong Lai, Chuanxia Zheng, and Tat-Jen Cham. 2021. AgileGAN: Stylizing Portraits by Inversion-Consistent Transfer Learning. ACM Trans. Graph. 40, 4, Article 117 (July 2021), 13 pages. Google ScholarDigital Library
    49. Hao Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In EMNLP/IJCNLP.Google Scholar
    50. Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and Daniel Cohen-Or. 2021. Designing an Encoder for StyleGAN Image Manipulation. arXiv preprint arXiv:2102.02766 (2021).Google Scholar
    51. Ngoc-Trung Tran, Viet-Hung Tran, Ngoc-Bao Nguyen, Trung-Kien Nguyen, and Ngai-Man Cheung. 2021. On data augmentation for GAN training. IEEE Transactions on Image Processing 30 (2021), 1882–1897.Google ScholarDigital Library
    52. Hung-Yu Tseng, Lu Jiang, Ce Liu, Ming-Hsuan Yang, and Weilong Yang. 2021. Regularizing Generative Adversarial Networks under Limited Data. ArXiv abs/2104.03310 (2021).Google Scholar
    53. Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. 2017. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6924–6932.Google ScholarCross Ref
    54. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, Vol. 30.Google Scholar
    55. Can Wang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. 2021a. Cross-Domain and Disentangled Face Manipulation with 3D Guidance. arXiv:2104.11228 [cs.CV]Google Scholar
    56. Pei Wang, Yijun Li, and Nuno Vasconcelos. 2021b. Rethinking and Improving the Robustness of Image Style Transfer. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
    57. Yaxing Wang, Abel Gonzalez-Garcia, David Berga, Luis Herranz, Fahad Shahbaz Khan, and Joost van de Weijer. 2020. MineGAN: Effective Knowledge Transfer From GANs to Target Domains With Few Images. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
    58. Yaxing Wang, Chenshen Wu, Luis Herranz, Joost van de Weijer, Abel Gonzalez-Garcia, and B. Raducanu. 2018. Transferring GANs: generating images from limited data. In ECCV.Google Scholar
    59. Zongze Wu, Yotam Nitzan, Eli Shechtman, and Dani Lischinski. 2021. StyleAlign: Analysis and Applications of Aligned StyleGAN Models. arXiv:2110.11323 [cs.CV]Google Scholar
    60. Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang. 2021. GAN Inversion: A Survey. arXiv:2101.05278 [cs.CV]Google Scholar
    61. Yinghao Xu, Yujun Shen, Jiapeng Zhu, Ceyuan Yang, and Bolei Zhou. 2021. Generative Hierarchical Features from Synthesizing Images. In CVPR.Google Scholar
    62. Ceyuan Yang, Yujun Shen, Yinghao Xu, and Bolei Zhou. 2021b. Data-Efficient Instance Generation from Instance Discrimination. arXiv preprint arXiv:2106.04566 (2021).Google Scholar
    63. Tao Yang, Peiran Ren, Xuansong Xie, and Lei Zhang. 2021a. GAN Prior Embedded Network for Blind Face Restoration in the Wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 672–681.Google ScholarCross Ref
    64. Jaejun Yoo, Youngjung Uh, Sanghyuk Chun, Byeongkyu Kang, and Jung-Woo Ha. 2019. Photorealistic style transfer via wavelet transforms. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9036–9045.Google ScholarCross Ref
    65. Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. 2015. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015).Google Scholar
    66. Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR.Google Scholar
    67. Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song Han. 2020a. Differentiable augmentation for data-efficient gan training. arXiv preprint arXiv:2006.10738 (2020).Google Scholar
    68. Zhengli Zhao, Zizhao Zhang, Ting Chen, Sameer Singh, and Han Zhang. 2020b. Image augmentations for GAN training. arXiv preprint arXiv:2006.02595 (2020).Google Scholar


ACM Digital Library Publication:



Overview Page: