Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer+
Adam Polyak+
Thomas Hayes+
Xi Yin+
Jie An
Songyang Zhang
Qiyuan Hu
Harry Yang
Oron Ashual
Oran Gafni
Devi Parikh+
Sonal Gupta+
Yaniv Taigman+

+Core Contributors

Abstract
We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today's image generation models. We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. First, we decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, we design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.


Text-to-Video


There is a table by a window with sunlight streaming through illuminating a pile of books. A blue unicorn flying over a mystical land. A dog wearing a Superhero outfit with red cape flying through the sky. Clown fish swimming through the coral reef. Robot dancing in times square. A litter of puppies running through the yard.
Image Image Image Image Image Image
A knight riding on a horse through the countryside. A panda playing on a swing set. A fluffy baby sloth with a knitted hat trying to figure out a laptop, close up, highly detailed, studio lighting, screen reflecting in its eyes. A teddy bear painting a portrait. A young couple walking in a heavy rain. A bear driving a car.
Image Image Image Image Image Image
A beautiful scenery which leads to a jumpscare. Sailboat sailing on a sunny day in a mountain lake. A musk ox grazing on beautiful wildflowers. Two kangaroos are busy cooking dinner in a kitchen. A small domesticated carnivorous mammal with soft fur, a short snout, and retractable claws. A spaceship being pulled into a blackhole.
Image Image Image Image Image Image
An artists brush painting on a canvas close up. An emoji of a baby panda wearing a red hat, blue gloves, green shirt, and blue pants. Cat watching tv with a remote in hand. Horse drinking water. Humans building a highway on mars. Hyper-realistic photo of an abandoned industrial site during a storm.
Image Image Image Image Image Image
A ufo hovering over aliens in a field. A golden retriever eating ice cream on a beautiful tropical beach at sunset. Monkey learning to play the piano. Photo of a cat singing in a barbershop quartet. A ballerina performs a beautiful and difficult dance on the roof of a very tall skyscraper, the city is lit up and glowing behind her. Unicorns running along a beach.
Image Image Image Image Image Image

Video Variations



Original Video Generated Video Variations
Image Image Image Image Image
Image Image Image Image Image
Image Image Image Image Image

Long Video Generation



a teddy bear on a skateboard in Times Square. A ballerina performs a beautiful and difficult dance on the roof of a very tall skyscraper, the city is lit up and glowing behind her. A panda playing on a swing set. A fluffy baby sloth with a knitted hat trying to figure out a laptop, close up, highly detailed, studio lighting, screen reflecting in its eyes. A teddy bear painting a portrait.
Image Image Image Image Image

Image Animation



Original Image Image Image Image Image
Animated Video Image Image Image Image

Image Interpolation



First Image Image Image Image Image
Second Image Image Image Image Image
Interpolated Video Image Image Image Image

T2V Generation Comparison


4K Illuminated Christmas Tree at Night During Snowstorm Construction Site Activity Digital Information Dramatic ocean sunset fire
Video Diffusion Models Image Image Image Image Image
CogVideo Image Image Image Image Image
Make-A-Video (Ours) Image Image Image Image Image
Irrigation Canal in Western USA Water Sourced from the Colorado River 4K Aerial Video Mountain river Shinjuku Time Lapse snowfall in city Star Time Lapse in Japan
Video Diffusion Models Image Image Image Image Image
CogVideo Image Image Image Image Image
Make-A-Video (Ours) Image Image Image Image Image

Image Interpolation Comparison



FILM Image Image Image Image
Make-A-Video (Ours) Image Image Image Image