Pandora: Towards General World Model
with Natural Language Actions and Video States

We introduce Pandora, a step towards a General World Model (GWM) that:

  • Simulates world states by generating videos across any domains
  • Allows any-time control with actions expressed in natural language

On-the-Fly Control with Natural Language

Pandora accepts free-text actions as inputs during video generation to steer the video on the fly. This differs crucially from previous text-to-video models which only allow text prompts at the beginning of the video. The on-the-fly control fulfills the promise of the world model to support interactive content generation and enhance robust reasoning and planning.

Image

Image

Image

Image

Image

Image

Predicting Alternative Futures as You Like

World models simulate alternative futures of the world. Pandora allows you to control the future. Here we show some counterfactual futures – different videos generated from the same initial state but different actions.

Initial state

Future 1

Future 2

Image

Image

Image

Image

Action: Magma erupts from the crater

Action: The sky gets dark

Image

Action: Turn left. There is a white van

Action: Turn right. There is a red car

Image

Action: The man waves his hand

Action: Two men walks towards the microphone

Image

Action: Use spoon to scoop some broccoli

Action: Use spoon to stir the rice

Simulating Worlds across Any Domains

Pandora is capable of generating videos across a wide range of general domains, such as indoor/outdoor, natural/urban, human/robot, 2D/3D, and other scenarios. You may find more videos in the Pandora’s Box gallery.

Action: Pouring milk into the glass cup from a milk bottle

Image

Action: Flame ignites woods emitting some smoke

Image

Action: Wind blows the leaves

Image

Action: Let the traffic go

Image

Action: The man drops the bag

Image

Action: Fold the towel

Image

Action: Jump right

Image

Action: Look around

Image

Action: Set fire on the river

Image

Action: Fireworks bloom in the night sky

Image

Learning Actions in One Domain and Using in Another

Instruction tuning with high-quality data allows the model to learn effective action control and transfer to different unseen domains. E.g., Pandora saw the only 2D game Coinrun during training, but can seamlessly apply the learned actions to other 2D games.

Source Domain

Image

Target Domain

Image

Image

Source Domain

Image

Target Domain

Image

Image

Autoregressive Model Yields Longer Videos

Existing diffusion video models typically produce videos of a fixed length. By integrating the video model with the Pandora autoregressive backbone, longer videos with potentially unlimited duration can be generated. We show 8-second videos generated by Pandora, even though our training videos are up to 5 seconds.

Action: The car moves forward

Image

Action: The man is flying

Image

Action: Walk into the theatre

Image

Action: The car moves forward

Image

Action: Let the train move

Image

Action: The plane is flying

Image

Limitation

Pandora as a prelimiary step towards GWM is still limited. It can fail in generating consistent videos, simulating complex scenarios, understanding commonsense and physical laws, and following instructions/actions.

Action: Pick up the wallet

Image

Action: Take out the nozzle

Image

Action: The man is dancing

Image

Action: The train door is open

Image

Note

We processed the videos on this website with FLAVR for frame interpolation to make them smoother. No other post-processing used.

If you have concern about the copyright of any image/video on this website, please contact us.

Pandora Team

Jiannan Xiang, Guangyi Liu, Yi Gu (equal contribution)

Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, Zhengzhong Liu

Eric P. Xing, Zhiting Hu

Maitrix.org

UCSD

MBZUAI

1
2
3
4
5
6
@article{xiang2024pandora,
  title={Pandora: Towards General World Model with Natural Language Actions and Video States},
  author={Jiannan Xiang and Guangyi Liu and Yi Gu and Qiyue Gao and Yuting Ning and Yuheng Zha and Zeyu Feng and Tianhua Tao and Shibo Hao and Yemin Shi and Zhengzhong Liu and Eric P. Xing and Zhiting Hu},
  journal={arXiv preprint arXiv:2406.09455},
  year={2024}
}