Seer: Language Instructed Video Prediction with Latent Diffusion Models

ICLR 2024

Xianfan Gu 1 Chuan Wen 1,2,3 Weirui Ye 1,2,3 Jiaming Song 4 Yang Gao 1,2,3

 1 Shanghai Qi Zhi Institute  2 IIIS, Tsinghua University  3 Shanghai Artificial Intelligence Laboratory  4 NVIDIA
 

Abstract

Imagining the future trajectory is the key for robots to make sound planning and successfully reach their goals. Therefore, text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning, i.e., predicting future video frames with a given language instruction and reference frames. It is a highly challenging task to ground task-level goals specified by instructions and high-fidelity frames together, requiring large-scale data and computation. To tackle this task and empower robots with the ability to foresee the future, we propose a sample and computation-efficient model, named Seer, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. We inflate the denoising U-Net and language conditioning model with two novel techniques, Autoregressive Spatial-Temporal Attention and Frame Sequential Text Decomposer, to propagate the rich prior knowledge in the pretrained T2I models across the frames. With the well-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a few layers on a small amount of data. The experimental results on Something Something V2 (SSv2) and Bridgedata datasets demonstrate our superior video prediction performance with around 210-hour training on 4 RTX 3090 GPUs: decreasing the FVD of the current SOTA model from 290 to 200 on SSv2 and achieving at least 70% preference in the human evaluation.



Method

Seer's pipeline includes an Inflated 3D U-Net for diffusion and a Frame Sequential Text Transformer (FSeq Text Transformer) for text conditioning. During training, all video frames are compressed to latent space with a pre-trained VAE encoder. Conditional latent vectors, sampled from reference video frames (ref.), are concatenated with noisy latent vectors along the frame axis to form the input latent. During inference, the conditional latent vectors are concatenated with Gaussian noise vectors, text conditioning is injected for each frame by FSeq Text Transformer (e.g., global instruction embedding “Moving remote and small remote away from each other.” is decomposed into 12 frames sub-instructions along the frame axis), and the denoised outputs are decoded to RGB video frames with the pre-trained VAE decoder.

Image

Results

Image



Text-conditioned Video Prediction/Manipulation


(Something-Something V2 Dataset)

Image Image

"Moving pen up."

"Moving pen down."
Image Image
"Turning the camera left while filming wall mounted fan."

"Turning the camera right while filming wall mounted fan."
Image Image

"Pretending to pick a box of juice up."

"Picking a box of juice up."
Image Image

"Pushing scissors so that it falls off the table."

"Picking a scissors up."
Input frames Text instruction Real Video Synthesized Video
Image "Pushing iphone
adapter from
left to right."
Image Image

Input frames Text instruction Real Video Synthesized Video
Image "Covering salt
shaker with
a towel."
Image Image

Input frames Text instruction Real Video Synthesized Video
Image "Dropping a
card in front of
a coin."
Image Image

Input frames Text instruction Real Video Synthesized Video
Image "Folding mat." Image Image

Input frames Text instruction Real Video Synthesized Video
Image "Moving a bottle
and a glass away
from each other."
Image Image

Input frames Text instruction Real Video Synthesized Video
Image "Tearing a piece
of paper into
two pieces."
Image Image

Text-conditioned Video Prediction/Manipulation (BridgeData)

Image Image

"Put banana on plate."

"Put corn on plate."
Image Image

"Pick up green mug."

"Pick up glass cup."
Image Image

"Turn lever vertical to front."

"Pick up knife from sink."
Image Image

"Pick up bowl and put in small box."

"Close small box flap."
Input frames Text instruction Real Video Synthesized Video
Image "Flip pot upright in sink." Image Image

Input frames Text instruction Real Video Synthesized Video
Image "Put pot on stove which
is near stove."
Image Image

Input frames Text instruction Real Video Synthesized Video
Image "Flip pot upright in sink." Image Image

Text-conditioned Video Prediction (Epic-Kitchens)

Input frames Text instruction Real Video Synthesized Video
Image "Open cupboard" Image Image

Input frames Text instruction Real Video Synthesized Video
Image "Wiping bowl with rag" Image Image

Input frames Text instruction Real Video Synthesized Video
Image "Chopping onion" Image Image

Image Image

Bibtex

        @article{gu2023seer,
            author  = {Gu, Xianfan and Wen, Chuan and Ye, Weirui and Song, Jiaming and Gao, Yang},
            title   = {Seer: Language Instructed Video Prediction with Latent Diffusion Models},
            journal = {arXiv preprint arXiv:2303.14897},
            year    = {2023},
        }
Website template credit to Plug-and-Play Diffusion.