Zixin Yin (殷子欣)

Graduating Spring 2027. Open to full-time roles.

Since 2021, I have been a PhD student at HKUST co-advised by Lionel Ni (President of HKUST-GZ) and Harry Shum (Former Executive Vice President of Microsoft). My research focuses on image and video generation, visual editing, and talking head synthesis.

From May 2026, I will join Meta as a research intern, working on video generation with Lu Yuan. Previously, from May 2025 to April 2026, I interned at IDEA and StepFun, collaborating with Lei Zhang and Gang Yu, and co-authored nine papers on image/video generation and editing. Concurrently, I led R&D teams working on video generation and agents at Xiaobing.ai.

From April 2023 to April 2025, I co-founded Morph Studio, a video generation startup with over 1.5 million users. Earlier, starting in 2022, I worked as a research intern at Xiaobing.ai, collaborating with Baoyuan Wang and Duomin Wang.

From August 2019 to May 2021, I worked with Carlo H. Séquin at UC Berkeley on the graphics project JIPCAD.

I received my B.S. from the Department of Computer Science and Technology (Honors Science Program) and completed the Honors Youth Program (少年班) at Xi'an Jiaotong University in 2021.

Publications

ReasonEdit preview

ReasonEdit: Towards Reasoning-Enhanced Image Editing Models

Step1X-Image Team (Core Member)

2026 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2026,

We show that unlocking MLLM reasoning via a thinking–editing–reflection loop significantly improves image editing accuracy by enhancing instruction understanding and automatic error correction, achieving consistent gains over state-of-the-art diffusion-based editors such as Step1X-Edit and Qwen-Image-Edit.

ColorCtrl preview

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

Zixin Yin, Xili Dai, Ling-Hao Chen, Deyu Zhou, Jianan Wang, Duomin Wang, Gang Yu, Lionel M. Ni, Lei Zhang, Heung-Yeung Shum

The Fourteenth International Conference on Learning Representations, ICLR 2026,

Completed from scratch in 30 days

We introduce ColorCtrl, a training-free method for text-guided color editing in images and videos. It enables precise, word-level control of color attributes while preserving geometry and material consistency. Experiments on SD3, FLUX.1-dev, and CogVideoX show that ColorCtrl outperforms existing training-free and commercial models, including GPT-4o and FLUX.1 Kontext Max, and generalizes well to instruction-based editing frameworks.

LazyDrag preview

LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

Zixin Yin, Xili Dai, Duomin Wang, Xianfang Zeng, Lionel M. Ni, Gang Yu, Heung-Yeung Shum

The Fourteenth International Conference on Learning Representations, ICLR 2026,

Completed from scratch in 30 days

LazyDrag, a drag-based image editing method for Multi-Modal Diffusion Transformers, eliminates implicit point matching, enabling precise geometric control and text guidance without test-time optimization.

SpeakerVid-5M: A Large-Scale High-Quality Dataset for audio-visual Dyadic Interactive Human Generation

Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li

The Fourteenth International Conference on Learning Representations, ICLR 2026,

We introduce SpeakerVid-5M, the first large-scale dataset designed specifically for the audio-visual dyadic interactive virtual human task.

ConsistEdit preview

ConsistEdit: Highly Consistent and Precise Training-free Visual Editing

Zixin Yin, Ling-Hao Chen, Lionel M. Ni, Xili Dai

ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia, ACM SIGGRAPH Asia 2025,

Oral Presentation (Completed from scratch in 21 days)

ConsistEdit is a training-free attention control method for MM-DiT that enables precise, structure-aware image and video editing. It supports multi-round edits with strong consistency and achieves state-of-the-art performance without manual design or test-time tuning.

Motion2Motion: Cross-topology Motion Retargeting with Sparse Correspondence

Ling-Hao Chen, Yuhong Zhang, Zixin Yin, Zhiyang Dou, Xin Chen, Jingbo Wang, Taku Komura, Lei Zhang,

ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia, ACM SIGGRAPH Asia 2025,

Oral Presentation

This work addresses animation retargeting across characters with differing skeletal topologies. Motion2Motion is a novel, training-free framework that requires only a few target motions and sparse bone correspondences. It overcomes topological inconsistencies without large datasets. Extensive evaluations show strong performance in both similar and cross-species settings, with practical applications in user-facing tools. Code and data will be released.

UniVerse-1: Unified Audio-Video Generation via Stitching of Experts

Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, Gang Yu

arXiv, 2509.06155,

We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video. To enhance training efficiency, we bypass training from scratch and instead employ a stitching of expertise technique.

Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors

Zhentao Yu*, Zixin Yin*, Deyu Zhou*, Duomin Wang, Finn Wong, Baoyuan Wang

2023 IEEE International Conference on Computer Vision, ICCV 2023,

We introduce a simple and novel framework for one-shot audio-driven talking head generation. Unlike prior works that require additional driving sources for controlled synthesis in a deterministic manner, we instead probabilistically sample all the holistic lip-irrelevant facial motions (i.e. pose, expression, blink, gaze, etc.) to semantically match the input audio while still maintaining both the photo-realism of audio-lip synchronization and the overall naturalness.

Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis

Duomin Wang, Yu Deng, Zixin Yin, Heung-Yeung Shum, Baoyuan Wang

2023 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2023,

We present a novel one-shot talking head synthesis method that achieves disentangled and fine-grained control over lip motion, eye gaze&blink, head pose, and emotional expression. We represent different motions via disentangled latent representations and leverage an image generator to synthesize talking heads from them.