Zixin Yin (殷子欣)

Email: zixin.yin[at]connect.ust.hk       Google Scholar      Github

I am currently a PhD student at The Hong Kong University of Science and Technology, under the supervision of Prof. Lionel Ni (president of HKUST-gz) and Prof. Harry Shum (former executive vice president of Microsoft), since 2021.

From April 2023 to April 2025, I served as the co-founder of Morph Studio. Prior to that, I worked closely with Baoyuan Wang and Duomin Wang as a research intern at Xiaobing.ai starting in 2022. My research interests include image and video generation, visual editing, and talking head synthesis.

From August 2019 to May 2021, I worked with Prof. Carlo H. Séquin at UC Berkeley on the graphics and CAD project JIPCAD.

I received my B.S. from the Department of Computer Science and Technology (Honors Science Program) and completed the Honors Youth Program (少年班) at Xi'an Jiaotong University in 2021.

profile photo
Publications
Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer
Zixin Yin, Xili Dai, Ling-Hao Chen, Deyu Zhou, Jianan Wang, Duomin Wang, Gang Yu, Lionel M. Ni, Lei Zhang, Heung-Yeung Shum
arXiv, 2508.09131,
[PDF] [Project] [Code(coming soon)] [BibTeX]

We introduce ColorCtrl, a training-free method for text-guided color editing in images and videos. It enables precise, word-level control of color attributes while preserving geometry and material consistency. Experiments on SD3, FLUX.1-dev, and CogVideoX show that ColorCtrl outperforms existing training-free and commercial models, including GPT-4o and FLUX.1 Kontext Max, and generalizes well to instruction-based editing frameworks.

SpeakerVid-5M: A Large-Scale High-Quality Dataset for audio-visual Dyadic Interactive Human Generation
Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li
arXiv, 2507.09862,
[PDF] [Project] [Code] [Dataset]

We introduce SpeakerVid-5M, the first large-scale dataset designed specifically for the audio-visual dyadic interactive virtual human task.

Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors
Zhentao Yu*, Zixin Yin*, Deyu Zhou*, Duomin Wang, Finn Wong, Baoyuan Wang
2023 IEEE International Conference on Computer Vision, ICCV 2023,
[PDF] [Project] [Code(coming soon)] [BibTeX]

We introduce a simple and novel framework for one-shot audio-driven talking head generation. Unlike prior works that require additional driving sources for controlled synthesis in a deterministic manner, we instead probabilistically sample all the holistic lip-irrelevant facial motions (i.e. pose, expression, blink, gaze, etc.) to semantically match the input audio while still maintaining both the photo-realism of audio-lip synchronization and the overall naturalness.

Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis
Duomin Wang, Yu Deng, Zixin Yin, Heung-Yeung Shum, Baoyuan Wang
2023 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2023,
[PDF] [Project] [Code] [BibTeX]

We present a novel one-shot talking head synthesis method that achieves disentangled and fine-grained control over lip motion, eye gaze&blink, head pose, and emotional expression.  We represent different motions via disentangled latent representations and leverage an image generator to synthesize talking heads from them.

(* means equal contribution)

The website template was adapted from Duomin Wang.