FastEdit: Fast Text-Guided Single-Image Editing via Semantic-Aware Diffusion Fine-Tuning

Zhi Chen* Zecheng Zhao* Yadan Luo Zi Huang
The University of Queensland

A fast text-guided single-image editing method, accelerating the editing process to only 17 seconds

Abstract

Text-guided single-image editing has emerged as a promising solution, which enables users to precisely alter an input image based on the target texts such as making a standing dog appear seated or a bird to spread its wings. While effective, conventional approaches require a two-step process including fine-tuning the target text embedding for over 1K iterations and the generative model for another 1.5K iterations. Although it ensures that the resulting image closely aligns with both the input image and the target text, this process often requires 7 minutes per image, posing a challenge for practical application due to its time-intensive nature. To address this bottleneck, we introduce FastEdit, a fast text-guided single-image editing method with semantic-aware diffusion fine-tuning, dramatically accelerating the editing process to only 17 seconds. FastEdit streamlines the generative model's fine-tuning phase, reducing it from 1.5K to a mere 50 iterations. For diffusion fine-tuning, we adopt certain time step values based on the semantic discrepancy between the input image and target text. Furthermore, FastEdit circumvents the initial fine-tuning step by utilizing an image-to-image model that conditions on the feature space, rather than the text embedding space. It can effectively align the target text prompt and input image within the same feature space and save substantial processing time. Additionally, we apply the parameter-efficient fine-tuning technique LoRA to U-net. With LoRA, FastEdit minimizes the model's trainable parameters to only 0.37% of the original size. At the same time, we can achieve comparable editing outcomes with significantly reduced computational overhead.

Approach

Our approach to single-image editing optimizes efficiency through semantic-aware diffusion fine-tuning, reducing training iterations to just 50 by matching time step values with the semantic discrepancy between the input image and target text. Additionally, we bypass the initial embedding optimization by employing an image-to-image variant of the Stable Diffusion model, which utilizes CLIP's image features for enhanced textual and visual feature alignment. Further, we incorporate Low-Rank Adaptation (LoRA), significantly reducing trainable parameters to only 0.37%, which effectively counters language drift issues common in other techniques and maintains high-quality outcomes.

Single-Image Editing Examples

FastEdit allows editing a single image using different target texts, demonstrating its versatility across various image types such as animals, scenes, humans, and paintings.

Comparison with Existing Methods

Quantitative Comparison to Baseline Methods

Compared to other baseline models, our method achieves similar-quality image editing in just 17 seconds.

Qualitative Comparison to Baseline Methods

FastEdit demonstrates the successful application of rapid text-based editing on a single real-world image while preserving the original image details

Additional Examples

Conditional Feature Interpolation by Increasing 𝜂 Using a Consistent Seed

Various Editing Options with Random Seeds

Human Face Manipulation

BibTex

        @article{chen2024fastedit,
            title={FastEdit: Fast Text-Guided Single-Image Editing via Semantic-Aware Diffusion Fine-Tuning},
            author={Chen, Zhi and Zhao, Zecheng and Luo, Yadan and Huang, Zi},
            journal={arXiv preprint arXiv:2408.03355},
            year={2024}
        }