Abstract
Method
Three Key Insights for MM-DiT Attention Control
Through an in-depth analysis of MM-DiT's attention architecture, we derive three key insights that enable effective training-free attention control:
Insight 1: Vision-only is Crucial
Editing effectiveness relies on modifying only the vision parts, since interfering with text tokens often leads to generation instability.
Insight 2: Homogeneous for All Layers
Unlike U-Net, each layer in MM-DiT retains rich semantic content. Thus, attention control can be applied to all layers.
Insight 3: Strong Structure Control from Q and K
Applying attention control solely on the vision parts of Q, K results in strong controllable structural preservation.
Figure 2. Visualization of projected Q, K, V vision tokens after PCA decomposition in attention layers of the MM-DiT blocks. Unlike U-Net, each layer in MM-DiT retains rich semantic content, supporting our insight that attention control must be applied to all layers.
Figure 4. Comparison of V token swapping strategies for content consistency. Swapping vision-only V tokens leads to superior content consistency under high consistency strength settings, while maintaining comparable editing capability to original methods when the consistency strength is low.
ConsistEdit Pipeline
Pipeline of ConsistEdit. Given a real image or video I_s and source text tokens P_s, we first invert the source to obtain the vision tokens z^T, which is concatenated with the target prompt tokens P_tg and passed into the generation process to produce the edited image or video I_tg. During inference, a mask M generated by our extraction method delineates editing and non-editing regions. We apply structure and content fusion to enable prompt-aligned edits while preserving structural consistency within edited regions and maintaining content integrity elsewhere.
Results
Quantitative Results on PIE-Bench
ConsistEdit achieves state-of-the-art performance on structure-consistent editing tasks:
Visual Comparisons
ConsistEdit demonstrates superior editing quality and consistency across various scenarios:
Figure 7. Qualitative comparison of methods on structure-consistent editing tasks. Our method achieves superior structure preservation and editing quality.
Figure 8. Qualitative comparison of methods on structure-inconsistent editing tasks showing adaptability across varied scenarios.
Video Editing Results
ConsistEdit maintains temporal consistency in video editing while achieving high-quality edits:
Video Examples
Example 1
Source Video
Edited Result
Example 2
Source Video
Edited Result
Method Comparisons
Comparison 1
Source
ConsistEdit (Ours)
UniEdit-Flow
DiTCtrl
FireFlow
RF-Solver
SDEdit
Comparison 2
Source
ConsistEdit (Ours)
UniEdit-Flow
DiTCtrl
FireFlow
RF-Solver
SDEdit
Key Capabilities
ConsistEdit demonstrates three key capabilities that enable robust and flexible editing:
Multi-Round Editing
Sequential edits with maintained consistency across iterations
Multi-Region Editing
Simultaneous editing of multiple regions in a single pass
Fine-Grained Control
Adjustable consistency strength for precise editing control
Multi-round Editing
Figure 5. Real image multi-round editing results. Starting from a real image, we first perform inversion to project it into the latent space. We then sequentially edit the clothing color, motion, and hair.
Multi-region Editing
Figure 6. Multi-region editing results showing ConsistEdit's ability to edit multiple regions simultaneously while preserving consistency.
Fine-Grained Control
Figure 9. Effect of consistency strength on structural consistency. High strength strictly enforces structural preservation, while low strength permits prompt-driven shape changes. Texture editing remains consistent, highlighting effective disentanglement.
FLUX Generalization
Figure 13. Examples of editing results with FLUX, demonstrating generalization to different MM-DiT variants.
Citation
@article{yin2025consistedit,
title={ConsistEdit: Highly Consistent and Precise Training-free Visual Editing},
author={Yin, Zixin and Chen, Ling-Hao and Ni, Lionel and Dai, Xili},
booktitle={SIGGRAPH Asia 2025 Conference Papers},
year={2025}
}