VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models

Hyeonho Jeong^* Geon Yeong Park^* Jong Chul Ye
^*Indicates Equal Contribution
BISPL, Korea Advanced Institute of Science and Technology

CVPR 2024

ArXiv PDF Code Data (gif) Data (png)

Video Motion Customization refers to the task of adapting pre-trained video generative models to create videos that feature a particular motion across various visual contexts and scenes. The goal is to retain the original motion patterns from an input video while presenting them in diverse visual settings. For example, given a video depicting sharks swimming, VMC aims to generate videos following the same motion of the sharks but in entirely distinct scenarios, such as airplanes flying in the sky or spaceships navigating in the space.

Input Video of Motion M	Goldfish + M + In the ocean	Iron Sharks + M + In the sky
	Airplanes + M + In the sky	Spaceships + M + In space

Abstract

Text-to-video diffusion models have advanced video generation significantly. However, customizing these models to generate videos with tailored motions presents a substantial challenge. In specific, they encounter hurdles in (1) accurately reproducing motion from a target video, and (2) creating diverse visual variations. For example, straightforward extensions of static image customization methods to video often lead to intricate entanglements of appearance and motion data. To tackle this, here we present the Video Motion Customization (VMC) framework, a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. Our approach introduces a novel motion distillation objective using residual vectors between consecutive noisy latent frames as a motion reference. The diffusion process then preserve low-frequency motion trajectories while mitigating high-frequency motion-unrelated noise in image space. We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.

Overview

The proposed Video Motion Customization (VMC) framework distills the motion trajectories from the residual between consecutive (latent) frames, namely motion vector $\delta v_t^n$ for $t>=0$. Specifically, we fine-tune only the temporal attention layers of the key-frame generation model by aligning the ground-truth and predicted motion vectors. After training, the customized key-frame generator is leveraged for target motion-driven video generation with new appearances context, e.g. "A chicken is walking in a city".

Video Motion Customization Results

from Low-resolution & Short Input Video to High-resolution, Long Output Video

Motion of Cars

Input Video of Motion M	Tank + M + On the snow	Police car + M + In a town
Lamborghini + M + In desert	Lamborghini + M + In space	Car + M + Underwater

Input Video of Motion M	Tank + M + On the road	Car + M + In desert

Input Video of Motion M	Tank + M + On the road	Horse + M + On the road
	Car + M + On the ice	Bus + M + On the ice

Input Video of Motion M	Elephant + M + In Africa	Bus + M + In a town	Ferrari + M

Motion of Airplanes

Input Video of Motion M	Spaceship + M + In sapce	Satellite + M	Shark + M + Under the sea

Input Video of Motion M	Spaceship + M	Satellite + M	Shark + M + In the ocean

Motion of Birds (flying)

Input Video of Motion M	Phoenix + M + Above lava	Majestic dragon + M + In forest

Motion of Birds (taking off)

Input Video of Motion M	Owl + M + From a cliff	Eagle + M + On the stone	Seagull + M + On the sand

Motion of Birds (walking)

Input Video of Motion M	Chicken + M + On the road	Chicken + M + In a city	Seagull + M + Underwater

Input Video of Motion M	Eagle + M + On edge	Duck + M + Around a pond	Owl + M + In the forest

Motion of Birds (floating)

Input Video of Motion M	Puppy + M + On the water	Turtle + M + On water
	Boat + M + On water	Ballon + M + On water

Motion of Human

Input Video of Motion M	Spider-man + M	Astronaut + M + underwater in deep sea

Input Video of Motion M	M + In the desert	M + On the ice	M + motorbike	Monkey + M

Motion of Plants

Input Video of Motion M	Rose + M	Starfish + M	Starfish + M + Chinese watercolor style

Motion of Diffusion

Input Video of Motion M	Ink + M + In water	Jellyfish + M	Flower + M

Motion of Fall

Input Video of Motion M	Gem stones + M	Colorful confetti + M	Bubbles + M
Feathers + M	Stars + M	Snow + M	Stones + M + Underwater

Motion of Space

Input Video of Motion M	M + In deep water	Stars + M + In sky

Motion of Huge animals

Input Video of Motion M	Tiger + M	Bulldog + M + Around flowers

Input Video of Motion M	Origami dinosaur + M	Pigeon + M + In the style of oil art

Input Video of Motion M	Bear + M + In a bamboo grove	Bear + M + On the pond	Lion + M + On the pond

Input Video of Motion M	Panda + M	Panda + M + On the snow	Tiger + M + On the snow

Motion of Small animals

Input Video of Motion M	Dog + M + On the grass	White fox + M + in the beach	Wolf + M + in the flowers

Input Video of Motion M	Puppy + M	Tiger + M	Tiger + M + On the grass	Wolf + M + On the snow

Input Video of Motion M	Dog + M + Watermelon	Fox + M + Watermelon	Rabbit + M + Watermelon + On the grass	Rabbit + M + Watermelon + On the sand
Squirrel + M + Watermelon	Squirrel + M + Orange	Squirrel + M + Watermelon + On the grass	Squirrel + M + Watermelon + On the Sand (Train M for 300 Steps)	Squirrel + M + Watermelon + On the Sand (Train M for 400 Steps)

Comparisons to Baselines

VMC is compared against 4 state-of-the-art baselines
⇨ VideoComposer, Gen-1, Tune A Video, Control A Video

A car is moving. ⇨ A car is moving, underwater.

Input Video	VideoComposer	Gen-1
Ours	Tune A Video	Control A Video

Two sharks are moving. ⇨ Two airplanes are moving in the sky.

Input Video	VideoComposer	Gen-1
Ours	Tune A Video	Control A Video

An owl is taking off. ⇨ A seagull is taking off on the sand.

Input Video	VideoComposer	Gen-1
Ours	Tune A Video	Control A Video

Ink is spreading. ⇨ Flower is spreading.

Input Video	VideoComposer	Gen-1
Ours	Tune A Video	Control A Video

Video Style Transfer

Input Video	Starry Night by Vincent Van Gogh.	Input Video	Oil painting of flowers.
Input Video	Classic anime from 1990.	Input Video	Starry Night by Vincent Van Gogh.

Backward Motion Customization

We customize video diffusion model to learn extermely unprobable motions,
backward motions from reversed real-world videos.

Real-world Video with Motion M	Reversed Video with Motion M^-1	Ink + M^-1 + In water	Jellyfish + M^-1
Real-world Video with Motion M	Reversed Video with Motion M^-1	Car + M^-1 + In desert	Tank + M^-1 + On the road
Real-world Video with Motion M	Reversed Video with Motion M^-1	Lamborghini + M^-1 + In space	Car + M^-1 + Underwater
Real-world Video with Motion M	Reversed Video with Motion M^-1	Eagle + M^-1 + On edge	Duck + M^-1 + Around a pond

BibTex

        @article{jeong2023vmc,
                title={VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models}, 
                author={Jeong, Hyeonho and Park, Geon Yeong and Ye, Jong Chul},
          	journal={arXiv preprint arXiv:2312.00845},
                year={2023},
        }