: Towards Interactive Humanoid Video Generation
Overview of FlowAct-R1. It consists of training and inference stages: training integrates converting base full-attention DiT to streaming AR model via autoregressive adaptation, joint audio-motion finetuning for better lip-sync and body motion, multi-stage diffusion distillation; inference adopts a structured memory bank (Reference/Long/Short-term Memory, Denoising Stream) with chunkwise autoregressive generation and memory refinement. Complemented by system-level optimizations, it achieves 25fps real-time 480p video generation (TTFF ~1.5s) with vivid behavioral transitions.
FlowAct-R1 streamingly synthesizes lifelike humanoid videos with naturally expressive behaviors, enabling infinite durations for truly seamless interaction.
FlowAct-R1 exhibits highly responsive interaction capabilities, demonstrating significant potential to empower real-time, low-latency instant communication scenarios.
Our method is robust to various character and motion styles.
FlowAct-R1 outperforms SOTA methods in human preference evaluation by simultaneously achieving real-time streaming, infinite-duration generation, and superior behavioral naturalness.
The orange segments indicate the percentage of user votes favoring FlowAct-R1 over other methods.
@article{wang2026flowact,
title={FlowAct-R1: Towards Interactive Humanoid Video Generation},
author={Wang, Lizhen and Zhu, Yongming and Ge, Zhipeng and Zheng, Youwei and Zhang, Longhao and Hu, Tianshu and Qin, Shiyang and Luo, Mingshuang and Zhang, Jiaxu and Chen, Xin and others},
journal={arXiv preprint arXiv:2601.10103},
year={2026}
}
@article{zhu2024infp,
title={INFP: Audio-driven interactive head generation in dyadic conversations},
author={Zhu, Yongming and Zhang, Longhao and Rong, Zhengkun and Hu, Tianshu and Liang, Shuang and Ge, Zhipeng},
journal={arXiv preprint arXiv:2412.04037},
year={2024}
}