ConsistEdit

Highly Consistent and Precise Training-free Visual Editing

Zixin Yin1, Ling-Hao Chen2,3, Lionel Ni1,4, Xili Dai4
1Hong Kong University of Science and Technology, 2Tsinghua University,
3International Digital Economy Academy, 4Hong Kong University of Science and Technology (Guangzhou)
SIGGRAPH Asia 2025

Abstract

Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing image and video generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others. Through an in-depth analysis of MM-DiT, we identify three key insights into its attention mechanisms. Building on these, we propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT. ConsistEdit incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens to produce consistent, prompt-aligned edits. Extensive experiments demonstrate that ConsistEdit achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios. Unlike prior methods, it is the first approach to perform editing across all inference steps and attention layers without handcraft, significantly enhancing reliability and consistency, which enables robust multi-round and multi-region editing. Furthermore, it supports progressive adjustment of structural consistency, enabling finer control. ConsistEdit represents a significant advancement in generative model editing and unlocks the full editing potential of MM-DiT architectures.

Overview

ConsistEdit Overview

Figure 1. (a) ConsistEdit enables multi-round editing by allowing users to specify both the target region and the nature of the editing through prompts. Unlike existing methods, it can perform structure-preserving (hair, clothing folds) and shape-changing with identity-preserving edits in edited regions while keeping non-edited regions intact. (b) ConsistEdit handles multi-region edits in one pass and preserves both the edited structure and unedited content. (c) Our method enables smooth control over consistency strength in the edited region. In contrast, existing approaches lack smooth transitions and often alter non-edited areas. (d) Beyond image editing and rectified flow models, ConsistEdit generalizes well to all MM-DiT variants, including diffusion and video models.

Method

Three Key Insights for MM-DiT Attention Control

Through an in-depth analysis of MM-DiT's attention architecture, we derive three key insights that enable effective training-free attention control:

Insight 1: Vision-only is Crucial

Editing effectiveness relies on modifying only the vision parts, since interfering with text tokens often leads to generation instability.

Insight 2: Homogeneous for All Layers

Unlike U-Net, each layer in MM-DiT retains rich semantic content. Thus, attention control can be applied to all layers.

Insight 3: Strong Structure Control from Q and K

Applying attention control solely on the vision parts of Q, K results in strong controllable structural preservation.

Q, K, V Vision Token Visualization

Figure 2. Visualization of projected Q, K, V vision tokens after PCA decomposition in attention layers of the MM-DiT blocks. Unlike U-Net, each layer in MM-DiT retains rich semantic content, supporting our insight that attention control must be applied to all layers.

Vision-only vs Original Comparison

Figure 4. Comparison of V token swapping strategies for content consistency. Swapping vision-only V tokens leads to superior content consistency under high consistency strength settings, while maintaining comparable editing capability to original methods when the consistency strength is low.

ConsistEdit Pipeline

ConsistEdit Pipeline

Pipeline of ConsistEdit. Given a real image or video I_s and source text tokens P_s, we first invert the source to obtain the vision tokens z^T, which is concatenated with the target prompt tokens P_tg and passed into the generation process to produce the edited image or video I_tg. During inference, a mask M generated by our extraction method delineates editing and non-editing regions. We apply structure and content fusion to enable prompt-aligned edits while preserving structural consistency within edited regions and maintaining content integrity elsewhere.

Results

Quantitative Results on PIE-Bench

ConsistEdit achieves state-of-the-art performance on structure-consistent editing tasks:

Method Canny SSIM ↑ BG PSNR ↑ BG SSIM ↑ CLIP Whole ↑ CLIP Edited ↑
SDEdit 0.6795 23.99 0.8697 26.59 22.80
UniEdit-Flow 0.8029 30.56 0.9554 26.55 22.59
DiTCtrl 0.8235 29.54 0.9632 26.63 22.97
ConsistEdit (Ours) 0.8811 36.76 0.9869 27.19 23.73

Visual Comparisons

ConsistEdit demonstrates superior editing quality and consistency across various scenarios:

Structure-consistent editing comparison

Figure 7. Qualitative comparison of methods on structure-consistent editing tasks. Our method achieves superior structure preservation and editing quality.

Structure-inconsistent editing comparison

Figure 8. Qualitative comparison of methods on structure-inconsistent editing tasks showing adaptability across varied scenarios.

Video Editing Results

ConsistEdit maintains temporal consistency in video editing while achieving high-quality edits:

Video Examples

Example 1
Source Video
Edited Result
Example 2
Source Video
Edited Result

Method Comparisons

Comparison 1
Source
ConsistEdit (Ours)
UniEdit-Flow
DiTCtrl
FireFlow
RF-Solver
SDEdit
Comparison 2
Source
ConsistEdit (Ours)
UniEdit-Flow
DiTCtrl
FireFlow
RF-Solver
SDEdit

Key Capabilities

ConsistEdit demonstrates three key capabilities that enable robust and flexible editing:

Multi-Round Editing

Sequential edits with maintained consistency across iterations

Multi-Region Editing

Simultaneous editing of multiple regions in a single pass

Fine-Grained Control

Adjustable consistency strength for precise editing control

Multi-round Editing

Real image multi-round editing

Figure 5. Real image multi-round editing results. Starting from a real image, we first perform inversion to project it into the latent space. We then sequentially edit the clothing color, motion, and hair.

Multi-region Editing

Multi-region editing

Figure 6. Multi-region editing results showing ConsistEdit's ability to edit multiple regions simultaneously while preserving consistency.

Fine-Grained Control

Fine-grained consistency control

Figure 9. Effect of consistency strength on structural consistency. High strength strictly enforces structural preservation, while low strength permits prompt-driven shape changes. Texture editing remains consistent, highlighting effective disentanglement.

FLUX Generalization

FLUX editing results

Figure 13. Examples of editing results with FLUX, demonstrating generalization to different MM-DiT variants.

Citation

@article{yin2025consistedit,
  title={ConsistEdit: Highly Consistent and Precise Training-free Visual Editing},
  author={Yin, Zixin and Chen, Ling-Hao and Ni, Lionel and Dai, Xili},
  booktitle={SIGGRAPH Asia 2025 Conference Papers},
  year={2025}
}