My research focuses on building structured and controllable generative models that enable interaction and compositionality across modalities, including text, images, video, and 3D. I am particularly interested in developing human-centric multimodal systems that can perceive, generate, and interact with complex environments.
In my work, I explore unified structured representations as a foundation for intrinsic control and interaction in generative systems, going beyond prompt-based generation.
My long-term goal is to build interactive world models that support co-creation between humans and AI agents, enabling the modeling of dynamic environments, human behavior, and long-horizon interaction.
Tao Hu, Varun Jampani.
under review, 2026
→A model-agnostic diffusion framework for controllable motion-consistent video generation using spatiotemporally consistent noise sampling and joint appearance-motion modeling.
Tiantian Wang, Chun-Han Yao, Tao Hu, Mallikarjun Byrasandra Ramalinga Reddy, Ming-Hsuan Yang, Varun Jampani.
Technical Report, 2025
→A generalizable framework that leverages generic video diffusion priors for controllable 4D human generation from a single image, enabling pose and camera control.
Tao Hu, Fangzhou Hong, Zhaoxi Chen, Ziwei Liu.
arXiv:2404.01655, under review [Project Page] [Video] [arXiv] → The first work that constructs an interactive 3D human generation and editing system with multimodal controls (e.g., texts, images, hand-drawing sketches) in a unified framework.
Tao Hu, Fangzhou Hong, Ziwei Liu.
European Conference on Computer Vision (ECCV 2024)
[Project Page] [Video] [Code] [arXiv] [Media Coverage]
[Media Coverage in Chinese: 1,2]
→A new paradigm for 3D human generation from 2D image collections, with 3 key designs: a structured 2D latent space, a structured auto-decoder, and a structured latent diffusion model.
Shoukang Hu, Fangzhou Hong, Tao Hu , Liang Pan, Weiye Xiao, Haiyi Mei, Lei Yang, Ziwei Liu
International Journal of Computer Vision (IJCV 2025) [Paper][Project Page][Code] → A diffusion-based approach for layer-wise controllable 3D human generation.
Tao Hu, Fangzhou Hong, Ziwei Liu.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2024)
[Paper]
[Project Page]
[Video]
[Code]
[Media Coverage in Chinese: Media Heart, SenseTime Research]
→A new paradigm for learning dynamic human rendering from videos by jointly modeling the temporal motion dynamics and human appearances in a unified framework based on a novel surface-based triplane.
Tao Hu, Hongyi Xu, Linjie Luo, Tao Yu, Zerong Zheng, He Zhang, Yebin Liu, Matthias Zwicker.
IEEE Transactions on Visualization and Computer Graphics (TVCG 2023)
[Paper]
[Project Page]
[Video]
[Code]
→A virtual teleportation system using sparse view cameras based on a novel texel-aligned multimodal representation.
Tao Hu, Tao Yu, Zerong Zheng, He Zhang, Yebin Liu, Matthias Zwicker.
International Conference on 3D Vision (3DV 2022)
[Paper]
[Project Page] [Video] [Poster] [arXiv]
[Code] → The first work that combines classical volumetric rendering with probabilistic generative models for efficient and realistic dynamic human rendering.
Tao Hu, Geng Lin, Zhizhong Han, Matthias Zwicker.
IEEE Winter Conference on Applications of Computer Vision (WACV 2021) [Paper] [Code] [arXiv] → Extend the multi-view representation for generalizable geometry/texture reconstructions from single RGB images.
Tao Hu, Zhizhong Han, Matthias Zwicker.
AAAI Conference on Artificial Intelligence (AAAI 2020, Oral, top 10% among accepted papers in 3D vision track)
[Paper] [Code] [arXiv] → Introduce a self-supervised multi-view consistent inference technique to enforce geometric consistency for multi-view representation.
Tao Hu, Zhizhong Han, Abhinav Shrivastava, Matthias Zwicker.
IEEE ICCV Geometry Meets Deep Learning Workshop (ICCVW 2019, Oral) [Paper] [Code] [arXiv] → Present multi-view based 3D shape representation with a multi-view completion net for dense 3D shape completion.
Tao Hu, Gangyi Ding, Lijie Li, Longfei Zhang.
Highlights of Sciencepaper, Chinese Journal, May 2016. →Propose a parallel video player plugin for CryEngine3 for a speedup from 16 FPS to 54 FPS at a large-scale virtual stage with 40 LED screens playing videos simultaneously for digital performance.
Conference on Computer Vision and Pattern Recognition (CVPR)
International Conference on Computer Vision (ICCV)
European Conference on Computer Vision (ECCV)
International Conference on 3D Vision (3DV)
Winter Conference on Applications of Computer Vision (WACV)
Conference on Neural Information Processing Systems (NeurIPS)
International Conference on Learning Representations (ICLR)
International Conference on Machine Learning (ICML)
Asian Conference on Computer Vision (ACCV)
International Conference on Pattern Recognition (ICPR)
IEEE Conference on Virtual Reality (VR)
Journal Reviewer:
Computer Graphics Forum
Computer Vision and Image Understanding
Image and Vision Computing
Pattern Recognition Letters
Selected Awards & Honors
Graduate National Scholarship (Top 2%), Ministry of Education of China 2016
Undergraduate National Scholarship (Top 2%), Ministry of Education of China 2014
Teaching Experience
Teaching Assistant, Dept. of Computer Science, UMD.
CMSC425 Game Programming (Prof. Roger Eastman), Fall 2019
CMSC425 Game Programming (Prof. Roger Eastman), Spring 2019
CMSC 216 Introduction to Computer Systems (Mr. Laurence Herman), Fall 2018