Video Diffusion Alignment via Reward Gradient

Anonymous Authors

Abstract

We have made significant progress towards building foundational video diffusion models. As these models are trained using large-scale unsupervised data, it has become crucial to adapt these models to specific downstream tasks, such as video-text alignment or ethical video generation. Adapting these models via supervised fine-tuning requires collecting target datasets of videos, which is challenging and tedious. In this work, we instead utilize pre-trained reward models that are learned via preferences on top of powerful discriminative models. These models contain dense gradient information with respect to generated RGB pixels, which is critical to be able to learn efficiently in complex search spaces, such as videos. We show that our approach can enable alignment of video diffusion for aesthetic generations, similarity between text context and video, as well long horizon video generations that are 3X longer than the training sequence length. We show our approach can learn much more efficiently in terms of reward queries and compute than previous gradient-free approaches for video generation.

Aesthetic Reward, in-distribution prompts

We compare VADER against its base model ModelScope and prior alignment approaches, including DDPO and DiffusionDPO, using the aesthetics reward, for in-distribution prompts.

"A man smiles as he stirs his food in the pot"
VADER (Ours)
Image
ModelScope
Image
DDPO
Image
DPO
Image
"A woman in a purple top pulling food out of a oven"
VADER (Ours)
Image
ModelScope
Image
DDPO
Image
DPO
Image
"At night on a street with a group of a bicycle riders riding down the road together"
VADER (Ours)
Image
ModelScope
Image
DDPO
Image
DPO
Image

Click here for More Results

HPS Reward, in-distribution prompts

We compare VADER against its base model ModelScope and prior alignment approaches, including DDPO and DiffusionDPO, using the HPS reward, for in-distribution prompts.

"some people holding umbrellas and standing by a car in the rain"
VADER (Ours)
Image
ModelScope
Image
DDPO
Image
DPO
Image
"A woman eating fresh vegetables from a bowl"
VADER (Ours)
Image
ModelScope
Image
DDPO
Image
DPO
Image
"A man getting food ready while people watch."
VADER (Ours)
Image
ModelScope
Image
DDPO
Image
DPO
Image

Click here for More Results

Aesthetic Reward, OOD prompts

We compare VADER against its base model ModelScope and prior alignment approaches, including DDPO and DiffusionDPO, using the aesthetics reward, for OOD prompts.

"a dolphin riding a bike"
VADER (Ours)
Image
ModelScope
Image
DDPO
Image
DPO
Image
"a wolf washing the dishes"
VADER (Ours)
Image
ModelScope
Image
DDPO
Image
DPO
Image
"a sheep playing chess"
VADER (Ours)
Image
ModelScope
Image
DDPO
Image
DPO
Image

Click here for More Results

HPS Reward, OOD prompts

We compare VADER against its base model ModelScope and prior alignment approaches, including DDPO and DiffusionDPO, using the HPS reward, for OOD prompts.

"a chicken riding a bike"
VADER (Ours)
Image
ModelScope
Image
DDPO
Image
DPO
Image
"a monkey washing the dishes"
VADER (Ours)
Image
ModelScope
Image
DDPO
Image
DPO
Image
"a deer playing chess"
VADER (Ours)
Image
ModelScope
Image
DDPO
Image
DPO
Image

Click here for More Results