From April 2023 to April 2025, I served as the co-founder of Morph Studio.
Prior to that, I worked closely with
Baoyuan Wang
and
Duomin Wang
as a research intern at
Xiaobing.ai starting in 2022.
My research interests include image and video generation, visual editing, and talking head synthesis.
From August 2019 to May 2021, I worked with
Prof. Carlo H. Séquin
at UC Berkeley on the graphics and CAD project
JIPCAD.
I received my B.S. from the Department of Computer Science and Technology
(Honors Science Program) and completed the Honors Youth Program (少年班) at
Xi'an Jiaotong University in 2021.
Publications
Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer Zixin Yin, Xili Dai, Ling-Hao Chen, Deyu Zhou, Jianan Wang, Duomin Wang, Gang Yu, Lionel M. Ni, Lei Zhang, Heung-Yeung Shum
arXiv, 2508.09131,
[PDF][Project][Code(coming soon)][BibTeX]
We introduce ColorCtrl, a training-free method for text-guided color editing in images and videos. It enables precise, word-level control of color attributes while preserving geometry and material consistency. Experiments on SD3, FLUX.1-dev, and CogVideoX show that ColorCtrl outperforms existing training-free and commercial models, including GPT-4o and FLUX.1 Kontext Max, and generalizes well to instruction-based editing frameworks.
SpeakerVid-5M: A Large-Scale High-Quality Dataset for audio-visual Dyadic Interactive Human Generation
Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li
arXiv, 2507.09862,
[PDF][Project][Code][Dataset]
We introduce SpeakerVid-5M, the first large-scale dataset designed specifically for the audio-visual dyadic interactive virtual human task.
Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors
Zhentao Yu*, Zixin Yin*, Deyu Zhou*, Duomin Wang, Finn Wong, Baoyuan Wang
2023 IEEE International Conference on Computer Vision, ICCV 2023,
[PDF][Project][Code(coming soon)][BibTeX]
We introduce a simple and novel framework for one-shot audio-driven talking head generation. Unlike prior works that require additional driving sources for controlled synthesis in a deterministic manner, we instead probabilistically sample all the holistic lip-irrelevant facial motions (i.e. pose, expression, blink, gaze, etc.) to semantically match the input audio while still maintaining both the photo-realism of audio-lip synchronization and the overall naturalness.
Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis
Duomin Wang, Yu Deng, Zixin Yin, Heung-Yeung Shum, Baoyuan Wang
2023 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2023,
[PDF][Project][Code][BibTeX]
We present a novel one-shot talking head synthesis method that achieves disentangled and fine-grained control over lip motion, eye gaze&blink, head pose, and emotional expression.
We represent different motions via disentangled latent representations and leverage an image generator to synthesize talking heads from them.
(* means equal contribution)
The website template was adapted from Duomin Wang.