“ASSET: autoregressive semantic scene editing with transformers at high resolutions” by Liu, Shetty, Hinz, Fisher, Zhang, et al. …

  • ©Difan Liu, Sandesh Shetty, Tobias Hinz, Matthew Fisher, Richard Zhang, Taesung Park, and Evangelos Kalogerakis




    ASSET: autoregressive semantic scene editing with transformers at high resolutions



    We present ASSET, a neural architecture for automatically modifying an input high-resolution image according to a user’s edits on its semantic segmentation map. Our architecture is based on a transformer with a novel attention mechanism. Our key idea is to sparsify the transformer’s attention matrix at high resolutions, guided by dense attention extracted at lower image resolutions. While previous attention mechanisms are computationally too expensive for handling high-resolution images or are overly constrained within specific image regions hampering long-range interactions, our novel attention mechanism is both computationally efficient and effective. Our sparsified attention mechanism is able to capture long-range interactions and context, leading to synthesizing interesting phenomena in scenes, such as reflections of landscapes onto water or fora consistent with the rest of the landscape, that were not possible to generate reliably with previous convnets and transformer approaches. We present qualitative and quantitative results, along with user studies, demonstrating the effectiveness of our method. Our code and dataset are available at our project page: https://github.com/DifanLiu/ASSET


    1. Samira Abnar and Willem Zuidema. 2020. Quantifying attention flow in transformers. In Proc. ACL.Google ScholarCross Ref
    2. David Bau, Hendrik Strobelt, William Peebles, Jonas Wulf, Bolei Zhou, Jun-Yan Zhu, and Antonio Torralba. 2019. Semantic Photo Manipulation with a Generative Image Prior. ACM Trans. Graph. 38, 4 (2019).Google ScholarDigital Library
    3. Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150.Google Scholar
    4. Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. 2018. Coco-stuff: Thing and stuff classes in context. In Proc. CVPR.Google ScholarCross Ref
    5. Chenjie Cao, Yuxin Hong, Xiang Li, Chengrong Wang, Chengming Xu, XiangYang Xue, and Yanwei Fu. 2021. The Image Local Autoregressive Transformer. In Proc. NeurIPS.Google Scholar
    6. Jiawen Chen, Andrew Adams, Neal Wadhwa, and Samuel W. Hasinoff. 2016. Bilateral Guided Upsampling. ACM Trans. Graph. 35, 6 (2016).Google ScholarDigital Library
    7. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 4 (2017).Google Scholar
    8. Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In Proc. ICML.Google Scholar
    9. Qifeng Chen and Vladlen Koltun. 2017. Photographic image synthesis with cascaded refinement networks. In Proc. ICCV.Google ScholarCross Ref
    10. Yu Cheng, Zhe Gan, Yitong Li, Jingjing Liu, and Jianfeng Gao. 2020. Sequential attention GAN for interactive image editing. In Proc. ACM Multimedia.Google ScholarDigital Library
    11. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv:1904.10509.Google Scholar
    12. Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. 2021a. Twins: Revisiting the design of spatial attention in vision transformers. In Proc. NeurIPS.Google Scholar
    13. Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. 2021b. Conditional positional encodings for vision transformers. arXiv:2102.10882.Google Scholar
    14. Helisa Dhamo, Azade Farshad, Iro Laina, Nassir Navab, Gregory D Hager, Federico Tombari, and Christian Rupprecht. 2020. Semantic image manipulation using scene graphs. In Proc. CVPR.Google ScholarCross Ref
    15. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16×16 words: Transformers for image recognition at scale. In Proc. ICLR.Google Scholar
    16. Patrick Esser, Robin Rombach, Andreas Blattmann, and Björn Ommer. 2021b. Image-BART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis. In Proc. NeurIPS.Google Scholar
    17. Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021a. Taming transformers for high-resolution image synthesis. In Proc. CVPR.Google ScholarCross Ref
    18. David Ferstl, Christian Reinbacher, Rene Ranftl, Matthias Ruether, and Horst Bischof. 2013. Image Guided Depth Upsampling Using Anisotropic Total Generalized Variation. In Proc. ICCV.Google ScholarDigital Library
    19. Shuyang Gu, Jianmin Bao, Hao Yang, Dong Chen, Fang Wen, and Lu Yuan. 2019. Mask-guided portrait editing with conditional gans. In Proc. CVPR.Google ScholarCross Ref
    20. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proc. NeurIPS.Google Scholar
    21. Tobias Hinz, Matthew Fisher, Oliver Wang, and Stefan Wermter. 2021. Improved techniques for training single-image gans. In Proc. WACV.Google ScholarCross Ref
    22. Tobias Hinz, Stefan Heinrich, and Stefan Wermter. 2019. Generating multiple objects at spatially distinct locations. In Proc. ICLR.Google Scholar
    23. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. In Proc. ICLR.Google Scholar
    24. Seunghoon Hong, Xinchen Yan, Thomas Huang, and Honglak Lee. 2018. Learning hierarchical semantic image manipulation through structured representations. In Proc. NeurIPS.Google Scholar
    25. Tak-Wai Hui, Chen Change Loy, and Xiaoou Tang. 2016. Depth Map Super-Resolution by Deep Multi-Scale Guidance. In Proc. ECCV.Google Scholar
    26. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. In Proc. CVPR.Google ScholarCross Ref
    27. Yifan Jiang, Shiyu Chang, and Zhangyang Wang. 2021. TransGAN: Two Transformers Can Make One Strong GAN. In Proc. NeurIPS.Google Scholar
    28. Youngjoo Jo and Jongyoul Park. 2019. Sc-fegan: Face editing generative adversarial network with user’s sketch and color. In Proc. ICCV.Google ScholarCross Ref
    29. Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In Proc. ICLR.Google Scholar
    30. Johannes Kopf, Michael F Cohen, Dani Lischinski, and Matt Uyttendaele. 2007. Joint bilateral upsampling. ACM Trans. Graph. 26, 3 (2007).Google ScholarDigital Library
    31. Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. 2020. Maskgan: Towards diverse and interactive facial image manipulation. In Proc. CVPR.Google ScholarCross Ref
    32. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proc. ACL.Google ScholarCross Ref
    33. Huan Ling, Karsten Kreis, Daiqing Li, Seung Wook Kim, Antonio Torralba, and Sanja Fidler. 2021. EditGAN: High-Precision Semantic Image Editing. In Proc. NeurIPS.Google Scholar
    34. Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. 2018. Image inpainting for irregular holes using partial convolutions. In Proc. ECCV.Google ScholarDigital Library
    35. Hongyu Liu, Ziyu Wan, Wei Huang, Yibing Song, Xintong Han, and Jing Liao. 2021b. PD-GAN: Probabilistic Diverse GAN for Image Inpainting. In Proc. CVPR.Google ScholarCross Ref
    36. Hongyu Liu, Ziyu Wan, Wei Huang, Yibing Song, Xintong Han, Jing Liao, Bin Jiang, and Wei Liu. 2021c. DeFLOCNet: Deep Image Editing via Flexible Low-level Controls. In Proc. CVPR.Google ScholarCross Ref
    37. Ming-Yu Liu, Oncel Tuzel, and Yuichi Taguchi. 2013. Joint geodesic upsampling of depth images. In Proc. CVPR.Google ScholarDigital Library
    38. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021a. Swin transformer: Hierarchical vision transformer using shifted windows. In Proc. ICCV.Google ScholarCross Ref
    39. Riccardo de Lutio, Stefano D’aronco, Jan Dirk Wegner, and Konrad Schindler. 2019. Guided super-resolution as pixel-to-pixel transformation. In Proc. ICCV.Google ScholarCross Ref
    40. Seonghyeon Nam, Yunji Kim, and Seon Joo Kim. 2018. Text-adaptive generative adversarial networks: manipulating images with natural language. In Proc. NeurIPS.Google Scholar
    41. Evangelos Ntavelis, Andrés Romero, Iason Kastanis, Luc Van Gool, and Radu Timofte. 2020. SESAME: semantic editing of scenes by adding, manipulating or erasing objects. In Proc. ECCV.Google ScholarDigital Library
    42. Jaesik Park, Hyeongwoo Kim, Yu-Wing Tai, Michael S. Brown, and Inso Kweon. 2011. High quality depth map upsampling for 3D-TOF cameras. In Proc. ICCV.Google ScholarDigital Library
    43. Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Semantic image synthesis with spatially-adaptive normalization. In Proc. CVPR.Google ScholarCross Ref
    44. Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In Proc. ICML.Google Scholar
    45. Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. Styleclip: Text-driven manipulation of stylegan imagery. In Proc. ICCV.Google ScholarCross Ref
    46. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In Proc. ICML.Google Scholar
    47. Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. 2019. Singan: Learning a generative model from a single natural image. In Proc. ICCV.Google ScholarCross Ref
    48. Tamar Rott Shaham, Michaël Gharbi, Richard Zhang, Eli Shechtman, and Tomer Michaeli. 2021. Spatially-Adaptive Pixelwise Networks for Fast Image Translation. In Proc. CVPR.Google ScholarCross Ref
    49. Assaf Shocher, Nadav Cohen, and Michal Irani. 2018. “zero-shot” super-resolution using deep internal learning. In Proc. CVPR.Google ScholarCross Ref
    50. Sitong Su, Lianli Gao, Junchen Zhu, Jie Shao, and Jingkuan Song. 2021. Fully Functional Image Manipulation Using Scene Graphs in A Bounding-Box Free Way. In Proc. ACM Multimedia.Google ScholarDigital Library
    51. Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. 2022. Resolution-robust Large Mask Inpainting with Fourier Convolutions. In Proc. WACV.Google ScholarCross Ref
    52. Zhentao Tan, Menglei Chai, Dongdong Chen, Jing Liao, Qi Chu, Bin Liu, Gang Hua, and Nenghai Yu. 2021. Diverse Semantic Image Synthesis via Probability Distribution Modeling. In Proc. CVPR.Google ScholarCross Ref
    53. Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient transformers: A survey. arXiv:2009.06732.Google Scholar
    54. Shubham Tulsiani and Abhinav Gupta. 2021. PixelTransformer: Sample Conditioned Signal Generation. In Proc. ICML.Google Scholar
    55. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. NeurIPS.Google Scholar
    56. Yael Vinker, Eliahu Horwitz, Nir Zabari, and Yedid Hoshen. 2021. Image Shape Manipulation from a Single Augmented Training Sample. In Proc. ICCV.Google ScholarCross Ref
    57. Ziyu Wan, Jingbo Zhang, Dongdong Chen, and Jing Liao. 2021. High-Fidelity Pluralistic Image Completion with Transformers. In Proc. ICCV.Google ScholarCross Ref
    58. Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-attention with linear complexity. arXiv:2006.04768.Google Scholar
    59. Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proc. ICCV.Google ScholarCross Ref
    60. Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proc. CVPR.Google ScholarCross Ref
    61. Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (2004).Google ScholarDigital Library
    62. Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li. 2017. Highresolution image inpainting using multi-scale neural patch synthesis. In Proc. CVPR.Google Scholar
    63. Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. 2021. Focal Self-attention for Local-Global Interactions in Vision Transformers. In Proc. NeurIPS.Google Scholar
    64. Jingyu Yang, Xinchen Ye, Kun Li, Chunping Hou, and Yao Wang. 2014. Color-Guided Depth Recovery From RGB-D Data Using an Adaptive Autoregressive Model. IEEE Transactions on Image Processing 23, 8 (2014).Google ScholarCross Ref
    65. Qingxiong Yang, Ruigang Yang, James Davis, and David Nister. 2007. Spatial-Depth Super Resolution for Range Images. In Proc. CVPR.Google ScholarCross Ref
    66. Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. 2018. Generative image inpainting with contextual attention. In Proc. CVPR.Google ScholarCross Ref
    67. Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. 2019. Free-form image inpainting with gated convolution. In Proc. ICCV.Google ScholarCross Ref
    68. Tao Yu, Zongyu Guo, Xin Jin, Shilin Wu, Zhibo Chen, Weiping Li, Zhizheng Zhang, and Sen Liu. 2020. Region normalization for image inpainting. In Proc. AAAI.Google ScholarCross Ref
    69. Yingchen Yu, Fangneng Zhan, Rongliang Wu, Jianxiong Pan, Kaiwen Cui, Shijian Lu, Feiying Ma, Xuansong Xie, and Chunyan Miao. 2021. Diverse image inpainting with bidirectional and autoregressive transformers. In Proc. ACM Multimedia.Google ScholarDigital Library
    70. Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big Bird: Transformers for Longer Sequences.. In Proc. NeurIPS.Google Scholar
    71. Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, and Jianfeng Gao. 2021. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proc. ICCV.Google ScholarCross Ref
    72. Pan Zhang, Bo Zhang, Dong Chen, Lu Yuan, and Fang Wen. 2020. Cross-domain correspondence learning for exemplar-based image translation. In Proc. CVPR.Google ScholarCross Ref
    73. Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. CVPR.Google ScholarCross Ref
    74. Chuanxia Zheng, Tat-Jen Cham, and Jianfei Cai. 2019. Pluralistic image completion. In Proc. CVPR.Google ScholarCross Ref
    75. Haitian Zheng, Zhe Lin, Jingwan Lu, Scott Cohen, Jianming Zhang, Ning Xu, and Jiebo Luo. 2021. Semantic Layout Manipulation with High-Resolution Sparse Attention. arXiv:2012.07288.Google Scholar
    76. Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. 2017. Scene parsing through ade20k dataset. In Proc. CVPR.Google ScholarCross Ref
    77. Xingran Zhou, Bo Zhang, Ting Zhang, Pan Zhang, Jianmin Bao, Dong Chen, ZhongfeiGoogle Scholar
    78. Zhang, and Fang Wen. 2021. CoCosNet v2: Full-Resolution Correspondence Learning for Image Translation. In Proc. CVPR.Google Scholar
    79. Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka. 2020. Sean: Image synthesis with semantic region-adaptive normalization. In Proc. CVPR.Google ScholarCross Ref

ACM Digital Library Publication: