TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation

TalkVerse Website ｜ 🖥️ GitHub | 🤗 Hugging Face | 📑 arXiv

We introduce TalkVerse, a large-scale, open corpus for single-person, audio-driven talking video generation designed to enable fair, reproducible comparison across methods, with 2.1M video clips(6.3K hours) from public T2V source (OpenHumanVid, Panda70M). We also release all the training and inference code.

🔥 Latest News

[2025-12-31] 🚀 We release the Training and Inference code for TalkVerse-5B model, a lightweight baseline capable of minute-long generation.

🛠️ Installation

Install dependencies:

# Basic requirements for Wan2.2 backbone
pip install -r requirements.txt

# Additional requirements for Speech-to-Video (S2V) and Audio processing
pip install -r requirements_s2v.txt

📥 Model Weights

Model	Description	Download
TalkVerse-5B	Audio-Driven LoRA weights trained on TalkVerse	HuggingFace
Wan2.2-TI2V-5B	Base Text/Image-to-Video Model (Backbone)	HuggingFace
Wav2Vec2	Audio Encoder (wav2vec2-large-xlsr-53-english)	HuggingFace

Download the models and place them in a ckpts/ directory (or adjust paths in scripts accordingly).

🚀 Inference

We support both single-video generation and batch generation for the S2V-5B model.

1. Single Video Generation

Use generate.py to generate a single video from an image and audio file.

python generate.py \
    --task s2v-5B \
    --ckpt_dir ./ckpts/Wan2.2-TI2V-5B \
    --lora_ckpt ./ckpts/talkverse_5b_lora.safetensors \
    --image examples/input_face.jpg \
    --audio examples/input_audio.wav \
    --prompt "A person talking naturally." \
    --offload_model True \
    --t5_cpu

Arguments:

--task s2v-5B: Selects the TalkVerse 5B model.
--ckpt_dir: Path to the base Wan2.2-TI2V-5B checkpoint.
--lora_ckpt: (Optional) Path to the trained S2V LoRA checkpoint if not merged.
--offload_model True & --t5_cpu: Saves VRAM for consumer GPUs (e.g., RTX 4090).

2. Batch Generation

For generating videos for a large dataset (e.g., testing set), use the provided shell script run_batch_generation.sh. This script automatically shards the input JSON file across available GPUs.

Prepare a batch config JSON file (e.g., batch_config.json):

[
  {
    "image": "path/to/img1.jpg",
    "audio": "path/to/audio1.wav",
    "prompt": "A man talking..."
  },
  {
    "image": "path/to/img2.jpg",
    "audio": "path/to/audio2.wav",
    "prompt": "A woman singing..."
  }
]

Run the script:
```
# Edit the script to set CKPT_DIR and BATCH_FILE paths
bash run_batch_generation.sh
```
The script allows configuration of:
- INFER_FRAMES=120 (Length of video)
- GUIDE_SCALE=6.5 (Classifier-free guidance scale)

🏋️ Training

We provide the training script train_s2v_5b.py to train the 5B model on the TalkVerse dataset (or your own data).

Data Preparation

The training script expects a VideoDataset that provides video, audio, and text pairs. Please refer to the Wan.src.dataset module or create_batch_config.py for data format details.

Running Training

You can run training using torchrun for distributed training (DDP).

# Example: Train on 8 GPUs with DDP
torchrun --nproc_per_node=8 train_s2v_5b.py \
    --distributed_policy ddp \
    --ckpt_dir ./ckpts/Wan2.2-TI2V-5B \
    --output_path ./output_s2v_5b \
    --batch_size 1 \
    --gradient_accumulation_steps 4 \
    --learning_rate 1e-5 \
    --unfreeze_strategy lora \
    --enable_roi_loss true \
    --enable_framepack true

Key Arguments:

--unfreeze_strategy: Controls which parts of the model to train.
- minimal: Unfreezes only time embeddings and new headers.
- framepack: Unfreezes frame packing modules.
- lora: Uses LoRA for efficient fine-tuning.
- full: Full parameter fine-tuning.
--enable_roi_loss: Applies higher loss weight to face/body regions (requires pre-computed masks).
--enable_framepack: Enables the context frame packing for long-video generation consistency.

🎥 Gradio Demo

We provide a Gradio interface for easy interaction with the model. The demo requires two backend APIs to be running: the Qwen API (for prompt enhancement) and the Wan API (for video generation).

Start the Qwen API (Terminal 1):

# Runs on GPU 0
python gradio/qwen_api.py

Start the Wan API (Terminal 2):

# Runs on GPU 1
export CUDA_VISIBLE_DEVICES='1'
python gradio/wan_api_trim.py

Launch the Gradio App (Terminal 3):
```
# Connects to APIs
python gradio/gradio_demo.py
```
The app allows you to:
- Upload an image and audio file.
- Enhance the input prompt using Qwen (optional).
- Generate a talking face video.
- View the result directly in the browser.

⚖️ License

The code, dataset and model weights are released under the Snap Inc. Non-Commercial License.

📚 Citation

If you find TalkVerse useful for your research, please cite:

@article{wang2025talkverse,
  title={TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation},
  author={Wang, Zhenzhi and Wang, Jian and Ma, Ke and Lin, Dahua and Zhou, Bing},
  journal={arXiv preprint arXiv:2512.14938},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
gradio		gradio
json_batch_inference_config		json_batch_inference_config
scripts		scripts
wan		wan
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
download_checkpoint.py		download_checkpoint.py
generate.py		generate.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_s2v.txt		requirements_s2v.txt
run_batch_generation.sh		run_batch_generation.sh
train_s2v_5b.py		train_s2v_5b.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation

🔥 Latest News

🛠️ Installation

📥 Model Weights

🚀 Inference

1. Single Video Generation

2. Batch Generation

🏋️ Training

Data Preparation

Running Training

🎥 Gradio Demo

⚖️ License

📚 Citation

About

Uh oh!

Releases

Packages

Languages

License

snap-research/TalkVerse

Folders and files

Latest commit

History

Repository files navigation

TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation

🔥 Latest News

🛠️ Installation

📥 Model Weights

🚀 Inference

1. Single Video Generation

2. Batch Generation

🏋️ Training

Data Preparation

Running Training

🎥 Gradio Demo

⚖️ License

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages