VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models


Hyeonho Jeong* Geon Yeong Park* Jong Chul Ye
*Indicates Equal Contribution

BISPL, Korea Advanced Institute of Science and Technology
CVPR 2024

Video Motion Customization refers to the task of adapting pre-trained video generative models to create videos that feature a particular motion across various visual contexts and scenes. The goal is to retain the original motion patterns from an input video while presenting them in diverse visual settings. For example, given a video depicting sharks swimming, VMC aims to generate videos following the same motion of the sharks but in entirely distinct scenarios, such as airplanes flying in the sky or spaceships navigating in the space.


Image
Input Video of Motion M
Image
Goldfish + M + In the ocean
Image
Iron Sharks + M + In the sky

Image
Airplanes + M + In the sky
Image
Spaceships + M + In space

Abstract

Text-to-video diffusion models have advanced video generation significantly. However, customizing these models to generate videos with tailored motions presents a substantial challenge. In specific, they encounter hurdles in (1) accurately reproducing motion from a target video, and (2) creating diverse visual variations. For example, straightforward extensions of static image customization methods to video often lead to intricate entanglements of appearance and motion data. To tackle this, here we present the Video Motion Customization (VMC) framework, a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. Our approach introduces a novel motion distillation objective using residual vectors between consecutive noisy latent frames as a motion reference. The diffusion process then preserve low-frequency motion trajectories while mitigating high-frequency motion-unrelated noise in image space. We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.



Overview


Image

The proposed Video Motion Customization (VMC) framework distills the motion trajectories from the residual between consecutive (latent) frames, namely motion vector $\delta v_t^n$ for $t>=0$. Specifically, we fine-tune only the temporal attention layers of the key-frame generation model by aligning the ground-truth and predicted motion vectors. After training, the customized key-frame generator is leveraged for target motion-driven video generation with new appearances context, e.g. "A chicken is walking in a city".


Video Motion Customization Results

from Low-resolution & Short Input Video to High-resolution, Long Output Video



Motion of Cars

Image
Input Video of Motion M
Image
Tank + M + On the snow
Image
Police car + M + In a town
Image
Lamborghini + M + In desert
Image
Lamborghini + M + In space
Image
Car + M + Underwater

Image
Input Video of Motion M
Image
Tank + M + On the road
Image
Car + M + In desert

Image
Input Video of Motion M
Image
Tank + M + On the road
Image
Horse + M + On the road
Image
Car + M + On the ice
Image
Bus + M + On the ice

Image
Input Video of Motion M
Image
Elephant + M + In Africa
Image
Bus + M + In a town
Image
Ferrari + M



Motion of Airplanes

Image
Input Video of Motion M
Image
Spaceship + M + In sapce
Image
Satellite + M
Image
Shark + M + Under the sea

Image
Input Video of Motion M
Image
Spaceship + M
Image
Satellite + M
Image
Shark + M + In the ocean



Motion of Birds (flying)

Image
Input Video of Motion M
Image
Phoenix + M + Above lava
Image
Majestic dragon + M + In forest



Motion of Birds (taking off)

Image
Input Video of Motion M
Image
Owl + M + From a cliff
Image
Eagle + M + On the stone
Image
Seagull + M + On the sand



Motion of Birds (walking)

Image
Input Video of Motion M
Image
Chicken + M + On the road
Image
Chicken + M + In a city
Image
Seagull + M + Underwater

Image
Input Video of Motion M
Image
Eagle + M + On edge
Image
Duck + M + Around a pond
Image
Owl + M + In the forest



Motion of Birds (floating)

Image
Input Video of Motion M
Image
Puppy + M + On the water
Image
Turtle + M + On water
Image
Boat + M + On water
Image
Ballon + M + On water



Motion of Human

Image
Input Video of Motion M
Image
Spider-man + M
Image
Astronaut + M + underwater in deep sea

Image
Input Video of Motion M
Image
M + In the desert
Image
M + On the ice
Image
M + motorbike
Image
Monkey + M



Motion of Plants

Image
Input Video of Motion M
Image
Rose + M
Image
Starfish + M
Image
Starfish + M + Chinese watercolor style



Motion of Diffusion

Image
Input Video of Motion M
Image
Ink + M + In water
Image
Jellyfish + M
Image
Flower + M



Motion of Fall

Image
Input Video of Motion M
Image
Gem stones + M
Image
Colorful confetti + M
Image
Bubbles + M
Image
Feathers + M
Image
Stars + M
Image
Snow + M
Image
Stones + M + Underwater



Motion of Space

Image
Input Video of Motion M
Image
M + In deep water
Image
Stars + M + In sky



Motion of Huge animals

Image
Input Video of Motion M
Image
Tiger + M
Image
Bulldog + M + Around flowers

Image
Input Video of Motion M
Image
Origami dinosaur + M
Image
Pigeon + M + In the style of oil art

Image
Input Video of Motion M
Image
Bear + M + In a bamboo grove
Image
Bear + M + On the pond
Image
Lion + M + On the pond

Image
Input Video of Motion M
Image
Panda + M
Image
Panda + M + On the snow
Image
Tiger + M + On the snow



Motion of Small animals

Image
Input Video of Motion M
Image
Dog + M + On the grass
Image
White fox + M + in the beach
Image
Wolf + M + in the flowers

Image
Input Video of Motion M
Image
Puppy + M
Image
Tiger + M
Image
Tiger + M + On the grass
Image
Wolf + M + On the snow


Image
Input Video of Motion M
Image
Dog + M + Watermelon
Image
Fox + M + Watermelon
Image
Rabbit + M + Watermelon + On the grass
Image
Rabbit + M + Watermelon + On the sand
Image
Squirrel + M + Watermelon
Image
Squirrel + M + Orange
Image
Squirrel + M + Watermelon + On the grass
Image
Squirrel + M + Watermelon + On the Sand
(Train M for 300 Steps)
Image
Squirrel + M + Watermelon + On the Sand
(Train M for 400 Steps)


Comparisons to Baselines

VMC is compared against 4 state-of-the-art baselines
⇨   VideoComposer,   Gen-1,   Tune A Video,   Control A Video



A car is moving.   ⇨   A car is moving, underwater.

Image
Input Video

Image
VideoComposer

Image
Gen-1

Image
Ours
Image
Tune A Video
Image
Control A Video


Two sharks are moving.   ⇨   Two airplanes are moving in the sky.

Image
Input Video

Image
VideoComposer

Image
Gen-1

Image
Ours

Image
Tune A Video

Image
Control A Video


An owl is taking off.   ⇨   A seagull is taking off on the sand.

Image
Input Video

Image
VideoComposer

Image
Gen-1

Image
Ours

Image
Tune A Video

Image
Control A Video


Ink is spreading.   ⇨   Flower is spreading.

Image
Input Video

Image
VideoComposer

Image
Gen-1

Image
Ours

Image
Tune A Video

Image
Control A Video


Video Style Transfer

Input Video
Image
Starry Night by Vincent Van Gogh.
Image
Input Video
Image
Oil painting of flowers.
Image
Input Video
Image
Classic anime from 1990.
Image
Input Video
Image
Starry Night by Vincent Van Gogh.
Image


Backward Motion Customization

We customize video diffusion model to learn extermely unprobable motions,
backward motions from reversed real-world videos.


Real-world Video with Motion M
Image
Reversed Video with Motion M-1
Image
Ink + M-1 + In water
Image
Jellyfish + M-1
Image
Real-world Video with Motion M
Image
Reversed Video with Motion M-1
Image
Car + M-1 + In desert
Image
Tank + M-1 + On the road
Image
Real-world Video with Motion M
Image
Reversed Video with Motion M-1
Image
Lamborghini + M-1 + In space
Image
Car + M-1 + Underwater
Image
Real-world Video with Motion M
Image

Reversed Video with Motion M-1
Image

Eagle + M-1 + On edge
Image

Duck + M-1 + Around a pond
Image

BibTex

        @article{jeong2023vmc,
                title={VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models}, 
                author={Jeong, Hyeonho and Park, Geon Yeong and Ye, Jong Chul},
          	journal={arXiv preprint arXiv:2312.00845},
                year={2023},
        }