Welcome! I am a research scientist at Snap Research Creative Vision team, working on Generative AI. I earned my Ph.D. in Computer Science from KAUST, where I was fortunate to be advised by Prof. Bernard Ghanem.
Prior to that, I received my B.Eng degree from Xi'an Jiaotong University (XJTU), China with the university’s highest undergraduate honor.
My primary research interests lie in computer vision and generative models.
I have authored 19 top-tier conference and journal papers, including one first-authored work with over 1,000 citations and three first-authored works with over 300 citations each.
In total, my publications have received over 3000 citations, and my current h-index is 19.
My representative work includes PointNeXt (NeurIPS, >1000 cites, >900 GitHub stars), Magic123 (ICLR, >400 cites, >1.6K GitHub stars) and Omni-ID (CVPR'25, products integrated into Snapchat).
I also serve as area chair for ICLR starting from 2025.
If you are interested in working in image/video generative models with me, please reach out at guocheng.qian [at] outlook.com

Ph.D. in CS
KAUST , 2019 - 2023

B.Eng in ME
XJTU , 2014 - 2018
Selected projects below; * / † denote equal contribution / corresponding author. See Full publication list .
This work can isolate a specific attribute from any image and merge those selected attributes from multiple images into a coherent generation.
Canvas-to-Image introduces a unified framework that consolidates heterogeneous controls (subject references, bounding boxes, pose skeletons) into a single canvas interface for high-fidelity compositional image generation.
LayerComposer enables Photoshop-like control for multi-subject text-to-image generation, allowing users to naturally compose scenes by intuitively placing, resizing, and locking elements in a layered canvas with high fidelity.
We prevent shortcuts in adapter training by explicitly providing the shortcuts during training, forcing the model to learn more robust representations.
ComposeMe is a human-centric generative model that enables disentangled control over multiple visual attributes — such as identity, hair, and garment — across multiple subjects, while also supporting text-based control.
ThinkDiff enables multimodal in-context reasoning in diffusion models by aligning vision-language models to LLM decoders, transferring reasoning capabilities without requiring complex reasoning-based datasets.
Omni-ID is a novel facial representation tailored for generative tasks, encoding identity features from unstructured images into a fixed-size representation that captures diverse expressions and poses.
WonderLand is a video-latent based approach for single-image 3D reconstruction in large-scale scenes.
AC3D studies when and how you should condition camera signals into a video diffusion model for a better camera control and a higher video quality.
Magic123 is a coarse-to-fine image-to-3D pipeline that produces high-quality high-resolution 3D content from a single unposed image by the guidance of both 2D and 3D priors.
Pix4Point shows that image pretraining siginificantly improves point cloud understanding.
PointNeXt boosts the performance of PointNet++ to the state-of-the-art level with improved training and scaling strategies.
ASSANet makes PointNet++ faster and more accurate.