LazyDrag

Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

Zixin Yin1,2, Xili Dai3, Duomin Wang2, Xianfang Zeng2, Lionel M. Ni1,3, Gang Yu2, Heung-Yeung Shum1
1The Hong Kong University of Science and Technology, 2StepFun, 3The Hong Kong University of Science and Technology (Guangzhou)
arXiv preprint

Abstract

The reliance on implicit point matching via attention has become a core bottleneck in drag-based editing, resulting in a fundamental compromise on weakened inversion strength and costly test-time optimization (TTO). This compromise severely limits the generative capabilities, suppressing high-fidelity inpainting and text-guided creation. In this paper, we introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers, which directly eliminates the reliance on implicit point matching. In concrete terms, our method generates an explicit correspondence map from user drag inputs as a reliable reference to boost the attention control. This reliable reference opens the potential for a stable full-strength inversion process, which is the first in the drag-based editing task. It obviates the necessity for TTO and unlocks the generative capability of models. Therefore, LazyDrag naturally unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach: opening the mouth of a dog and inpainting its interior, generating new objects like a "tennis ball", or for ambiguous drags, making context-aware changes like moving a hand into a pocket. Additionally, LazyDrag supports multi-round workflows with simultaneous move and scale operations. Evaluated on the DragBench, our method outperforms baselines in drag accuracy and perceptual quality, as validated by VIEScore and user studies. LazyDrag not only sets new state-of-the-art performance, but also paves a new way to editing paradigms.

Overview

LazyDrag Overview

(a) Top: Comparison between our method and two baselines. The leftmost image shows the input image with multiple drag instructions, each indicated by a different color. The text below each result indicates the additional prompt used for generation. "N/A" means no additional prompt. TTO denotes test-time optimization, where the method requires fine-tuning per image and multi-step latent optimization per drag instruction. Notably, our method successfully opens the mouth of the dog and inpaints its interior. Furthermore, with prompt guidance, we can generate diverse results even under ambiguous drag inputs without fine-tuning. (b) Bottom: Multi-round editing results using our approach. Our method supports not only sequential drag operations but also simultaneous actions like movement and scaling, maintaining visual coherence throughout.

Method

LazyDrag Pipeline

Pipeline of LazyDrag. (a) An input image is inverted to a latent code z_T. Our correspondence map generation then yields an updated latent, point matching map, and weights α. Tokens cached during inversion are used to guide the sampling process for identity and background preservation. (b) In attention input control, a dual strategy is employed. For background regions (gray color), Q, K, and V tokens are replaced with their cached originals. For destination (red and blue colors) and transition regions (yellow color), the K and V tokens are concatenated with re-encoded (K only) source tokens retrieved via the map. (c) Attention output refinement performs value blending of attention output.

LazyDrag introduces a novel approach to drag-based editing by replacing unreliable implicit attention-based point matching with explicit correspondence maps. This fundamental shift enables stable full-strength inversion without test-time optimization.

1
Explicit Correspondence Generation

Generate reliable correspondence maps from user drag inputs instead of relying on implicit attention mechanisms.

2
Full-Strength Inversion

Enable stable inversion process with full strength, eliminating the need for compromised weak inversion used in previous methods.

3
Unified Control

Seamlessly combine precise geometric control with text guidance for complex semantic editing operations.

4
Multi-Modal Integration

First method to work with Multi-Modal Diffusion Transformers, supporting advanced editing capabilities.

Results

Quantitative Results on DragBench

Our method achieves state-of-the-art performance across all metrics without requiring test-time optimization (TTO):

Method TTO-Req MD ↓ SC ↑ PQ ↑ O ↑
DragNoise ✓ 37.87 7.793 8.058 7.704
DragDiffusion ✓ 34.84 7.905 8.325 7.798
FreeDrag ✓ 34.09 7.928 8.281 7.816
DiffEditor ✓ 26.95 7.603 8.266 7.715
GoodDrag ✓ 22.17 7.834 8.318 7.795
DragText ✓ 21.51 7.992 8.227 7.886
FastDrag ✗ 31.84 7.935 8.278 7.904
Inpaint4Drag ✗ 23.68 7.802 7.961 7.615
LazyDrag (Ours) ✗ 21.49 8.205 8.395 8.210

MD: Mean Distance (lower is better), SC: Semantic Consistency, PQ: Perceptual Quality, O: Overall (higher is better)

Qualitative Results

Visual comparison with baseline methods on DragBench:

Qualitative comparison results

Qualitative results compared with baselines on DragBench. Best viewed with zoom-in.

Examples of Drag-Bench Cases with Various Additional Text Prompts

LazyDrag can resolve ambiguous drag instructions through text guidance, enabling semantically meaningful edits:

Text prompt examples

Examples showing how text prompts can guide ambiguous drag instructions to achieve different semantic meanings.

User Study Results

Expert evaluation by 20 participants on 32 cases from DragBench:

Method Preference (%)
DragNoise 5.00
DragDiffusion 8.75
FreeDrag 8.75
Other methods ≤3.75
LazyDrag (Ours) 61.88

Comparison Between Drag and Move Mode

LazyDrag supports both drag and move modes, each with distinct characteristics:

Drag vs Move mode comparison

Comparison between drag and move mode on DragBench.

Drag Mode

  • Enables natural geometric transformations
  • Supports 3D rotations and extensions
  • Slight degradation in detail texture preservation
  • Better for complex shape modifications

Move Mode

  • Better preserves object identity
  • Maintains fine details and textures
  • Less suitable for rotation or extension
  • Ideal for simple repositioning tasks

Citation

@article{yin2025lazydrag,
  title={LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence},
  author={Yin, Zixin and Dai, Xili and Wang, Duomin and Zeng, Xianfang and Ni, Lionel M. and Yu, Gang and Shum, Heung-Yeung},
  journal={arXiv preprint arXiv:2509.12203},
  year={2025}
}