Abstract
Method
Pipeline of LazyDrag. (a) An input image is inverted to a latent code z_T. Our correspondence map generation then yields an updated latent, point matching map, and weights α. Tokens cached during inversion are used to guide the sampling process for identity and background preservation. (b) In attention input control, a dual strategy is employed. For background regions (gray color), Q, K, and V tokens are replaced with their cached originals. For destination (red and blue colors) and transition regions (yellow color), the K and V tokens are concatenated with re-encoded (K only) source tokens retrieved via the map. (c) Attention output refinement performs value blending of attention output.
LazyDrag introduces a novel approach to drag-based editing by replacing unreliable implicit attention-based point matching with explicit correspondence maps. This fundamental shift enables stable full-strength inversion without test-time optimization.
Generate reliable correspondence maps from user drag inputs instead of relying on implicit attention mechanisms.
Enable stable inversion process with full strength, eliminating the need for compromised weak inversion used in previous methods.
Seamlessly combine precise geometric control with text guidance for complex semantic editing operations.
First method to work with Multi-Modal Diffusion Transformers, supporting advanced editing capabilities.
Results
Quantitative Results on DragBench
Our method achieves state-of-the-art performance across all metrics without requiring test-time optimization (TTO):
MD: Mean Distance (lower is better), SC: Semantic Consistency, PQ: Perceptual Quality, O: Overall (higher is better)
Qualitative Results
Visual comparison with baseline methods on DragBench:
Qualitative results compared with baselines on DragBench. Best viewed with zoom-in.
Examples of Drag-Bench Cases with Various Additional Text Prompts
LazyDrag can resolve ambiguous drag instructions through text guidance, enabling semantically meaningful edits:
Examples showing how text prompts can guide ambiguous drag instructions to achieve different semantic meanings.
User Study Results
Expert evaluation by 20 participants on 32 cases from DragBench:
| Method | Preference (%) |
|---|---|
| DragNoise | 5.00 |
| DragDiffusion | 8.75 |
| FreeDrag | 8.75 |
| Other methods | ≤3.75 |
| LazyDrag (Ours) | 61.88 |
Comparison Between Drag and Move Mode
LazyDrag supports both drag and move modes, each with distinct characteristics:
Comparison between drag and move mode on DragBench.
Drag Mode
- Enables natural geometric transformations
- Supports 3D rotations and extensions
- Slight degradation in detail texture preservation
- Better for complex shape modifications
Move Mode
- Better preserves object identity
- Maintains fine details and textures
- Less suitable for rotation or extension
- Ideal for simple repositioning tasks
Citation
@article{yin2025lazydrag,
title={LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence},
author={Yin, Zixin and Dai, Xili and Wang, Duomin and Zeng, Xianfang and Ni, Lionel M. and Yu, Gang and Shum, Heung-Yeung},
journal={arXiv preprint arXiv:2509.12203},
year={2025}
}