This repo provides training and inference code for the paper "Large Video Planner Enables Generalizable Robot Control"
Download all the metadata file for eight filtered dataset, and our third-party collected test set:
huggingface-cli download KempnerInstituteAI/LVP \
--include "data/**" \
--local-dir . \
--local-dir-use-symlinks FalseThis will download each data folder under data/.
Please put all downloaded checkpoints within data/ckpts
huggingface-cli download KempnerInstituteAI/LVP \
--include "checkpoints/**" \
--local-dir . \
--local-dir-use-symlinks False
mv checkpoints data/ckptsThis will take 66 GB disk space, be careful.
After downloading the trained checkpoints should be in path: data/ckpts/lvp_14B.ckpt.
This path is specified in configurations/algorithm/wan_i2v.yaml
This codebase uses the Wan 2.1 Image-to-Video (I2V) 14B model for video generation. The checkpoint includes:
- Wan2.1 diffusion model weights (14B parameters), from which we finetuned on.
- VAE encoder/decoder
- T5 text encoder (UMT5-XXL)
- CLIP image encoder (XLM-Roberta-Large-ViT-Huge)
Official Download Instructions: Please refer to the Wan 2.1 GitHub repository for the most up-to-date checkpoint download instructions.
Quick Download (using Hugging Face CLI):
# Download Wan 2.1 I2V 14B 480P (recommended for this codebase)
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./data/ckpts/Wan2.1-I2V-14B-480PThe checkpoint will be downloaded to ./data/ckpts/Wan2.1-I2V-14B-480P/ and automatically includes all necessary components (VAE, T5, CLIP, main model).
Note: The 480P model is used in our training pipeline. The checkpoint path is configured in configurations/algorithm/wan_i2v.yaml.
This document provides detailed instructions for running inference and training with the EI World Model codebase.
- Python 3.10
- CUDA 12.1+ (for GPU support)
- Conda or Mamba package manager
# using conda
conda create python=3.10 -n ei_world_model
conda activate ei_world_modelWe store python dependencies in requirements.txt
# Install core dependencies
pip install -r requirements.txt
# Install Flash Attention (for efficient attention)
# This may take several minutes to compile
pip install flash-attn --no-build-isolationNote: If you encounter issues with flash-attn, you can skip it for inference-only usage. It's primarily needed for efficient training.
If you want to convert generated videos to robot actions, install the video2robot pipeline dependencies:
cd video2robot
git submodule update --init --recursiveInstall external requirements (Hamer, Dex-retargeting, MegaSaM) following the upstream docs.
WandB is used for experiment tracking and logging.
# Login to WandB
wandb login
# Or set your API key
export WANDB_API_KEY=your_api_key_hereUpdate your WandB entity in configurations/config.yaml:
wandb:
entity: your-wandb-username # Change this to your WandB username or org
project: ei_world_model
mode: online # Use 'offline' for no internet, 'dryrun' for testingNote we set wandb to offline by default, so you can go through other part of the code without setting wandb first.
Test your installation with a quick inference run:
# Test with toy model (no checkpoints needed)
python -m main \
+name=test_installation \
experiment=exp_video \
algorithm=wan_toy \
dataset=dummy \
experiment.tasks=[validation] \
experiment.validation.limit_batch=1If this runs without errors, your environment is set up correctly!
For distributed training on SLURM clusters, you may need to set:
# For offline compute nodes with WandB sync
export WANDB_MODE=offline
export WANDB_DIR=/path/to/wandb/logs
# For debugging
export HYDRA_FULL_ERROR=1
export CUDA_LAUNCH_BLOCKING=1Issue: CUDA out of memory
- Reduce batch size in experiment config:
experiment.training.batch_size=1 - Enable gradient checkpointing:
algorithm.gradient_checkpointing_rate=1.0
Inference generates videos given an image and a text prompt using a pretrained model.
mkdir -p <your-output-folder>
python -m main \
+name=<your_exp_name> \
experiment=exp_video \
algorithm=wan_i2v \
dataset=ours_test \
experiment.tasks=[validation] \
algorithm.logging.video_type=single \
experiment.num_nodes=1 \
experiment.validation.limit_batch=null \
algorithm.hist_guidance=1.5 \
algorithm.lang_guidance=2.5+name=<your_exp_name>: Unique experiment name for this run. Used for logging and organizing outputs in WandB and file system.
-
experiment=exp_video: Specifies the experiment type- Points to: configurations/experiment/exp_video.yaml
- Defines: Training/validation settings, tasks, precision, batch size
-
algorithm=wan_i2v: Selects the Wan 2.1 Image-to-Video model- Points to: configurations/algorithm/wan_i2v.yaml
- Inherits from: wan_t2v.yaml
- You need to set checkpoint path in these two yaml files
-
dataset=ours_test: Specifies evaluation dataset- Should point to:
configurations/dataset/ours_test.yaml - Format: CSV with metadata (video_path, caption, height, width, fps, n_frames)
- The specific CSV format is disscused in dataset/README.md
- Should point to:
experiment.tasks=[validation]: Runs validation/inference mode- Executes the
validation()method in experiments/exp_video.py
- Executes the
-
cluster=fast_high: SLURM cluster settings we used for evaluation- Points to: configurations/cluster/phase3_eval.yaml
- Settings: 4 H100 GPUs, 48 CPUs, 512GB memory, 1-day time limit
-
experiment.num_nodes=1: Number of compute nodes (1 for inference)
-
experiment.validation.limit_batch=null: Process all batches- Set to a number (e.g.,
10) to limit evaluation to N batches for quick testing
- Set to a number (e.g.,
-
algorithm.hist_guidance=1.5: Historical guidance scale for conditioning on previous frames- Controls how strongly the model follows the input image
- Range: 0.0 (no guidance) to 3.0+ (strong guidance)
- Recommend: 1.5
-
algorithm.lang_guidance=2.5: Language guidance scale (classifier-free guidance)- Controls how strongly the model follows the text prompt
- Range: 0.0 (no guidance) to 5.0+ (strong guidance)
- Recommend: 2.0, 2.5
algorithm.logging.video_type=single: Save videos individually- Alternative:
grid- saves all videos in a grid layout
- Alternative:
Training fine-tunes the Wan 2.1 models on custom video datasets.
python -m main \
+name=final_i2v \
experiment=exp_video \
algorithm=wan_i2v \
dataset=mixture \
experiment.num_nodes=32 \
algorithm.lang_guidance=0 \
algorithm.hist_guidance=0 \
experiment.validation.val_every_n_step=100000000For rapid iteration and debugging, use a smaller toy model:
python -m main \
+name=print_dataset_mix_debug_train \
experiment=exp_video \
algorithm=wan_toy \
dataset=mixture \
experiment.num_nodes=1 \
algorithm.lang_guidance=0 \
algorithm.hist_guidance=0 \
experiment.validation.val_every_n_step=100000000+name=final_i2v: Experiment name for WandB logging and checkpoints
-
experiment=<your exp-name>: Same as inference- Default task:
[training](defined in exp_video.yaml)
- Default task:
-
algorithm=wan_i2voralgorithm=wan_toy:wan_i2v: Full 14B parameter model (wan_i2v.yaml)wan_toy: Tiny model for debugging (wan_toy.yaml)- Only 2 layers, 128 dimensions (vs 40 layers, 5120 dimensions)
- No checkpoint loading required
-
dataset=mixture: Combined dataset of multiple sources- Points to: configurations/dataset/mixture.yaml
- Includes: Pandas, Epic Kitchen, Ego4D, DROID, Something-Something, Bridge, AgibotWorld, Language Table
- Weighted mixture based on dataset sizes and importance
-
cluster=phase3: Training cluster settings we used- Points to: configurations/cluster/phase3.yaml
- Settings: 4 H100 GPUs per node, 32 node, priority queue, 14-day time limit
-
experiment.num_nodes=32: Multi-node distributed training- 32 nodes × 4 GPUs = 128 GPUs for full training
- Set to
1for debugging with toy model
The codebase uses Hydra for hierarchical configuration management:
-
Base Config: configurations/config.yaml
- Specifies defaults: experiment, dataset, algorithm, cluster
- WandB settings for logging
-
Config Composition: Hydra composes configs from multiple YAML files
- Command-line overrides:
algorithm.lang_guidance=2.5 - Inheritance:
wan_i2v.yamlinherits fromwan_t2v.yaml
- Command-line overrides:
-
Config Resolution: main.py resolves all configs and passes to experiment
- Entry:
python -m main +name=... experiment=... algorithm=... dataset=... - Hydra Setup: main.py loads and merges all configs
- Experiment Creation: experiments/exp_video.py builds experiment
- Task Execution: Calls
experiment.exec_task(task)for each task inexperiment.tasks- Training: Sets up dataloaders, trainer, and runs training loop
- Validation: Loads model, generates videos, saves outputs
The video2robot pipeline converts generated hand-motion videos into executable robot commands for dexterous robot hands. This enables generated videos from LVP to control real robots.
The pipeline consists of the following stages:
- Camera Estimation (MegaSaM): Estimates per-frame camera intrinsics and poses from the video
- Hand Pose Extraction (HAMER): Reconstructs 3D hand meshes and MANO parameters per frame
- Camera Alignment: Aligns HAMER wrist poses into the MegaSaM world coordinate frame
- Retargeting: Maps human hand joint angles to robot hand joint angles
- Real robot Conversion: Converts retargeted poses to G1 commands
./run_video_pipeline.sh <video_name> <output_dir>This command wraps the steps above and consolidates artifacts under <output__dir>/<video_name>/:
hamer.json– raw HAMER detections with per-frame wrist poses.align.json– HAMER sequence expressed in the MegaSaM cam0/world frame.retarget_vector.json– Inspire finger joint angles (vector retargeting results).cmd.json– converted Inspire DOF commands ready for playback.g1.json– wrist pose history transformed to the Inspire G1 coordinate system.<video_name>_droid.npz– MegaSaM camera intrinsics/extrinsics and depth caches.frames/– RGB frames extracted from the input video.summary.json– helpful index linking the video, calibration, and retarget outputs.
For executing generated hand commands on a real robot (e.g., Unitree G1 with Inspire hands), you can integrate the hand controller with the arm controller for unified control:
# Initialize hand command publisher and hand state subscriber
self.HandCmb_publisher = ChannelPublisher(kTopicInspireCommand, MotorCmds_)
self.HandCmb_publisher.Init()
self.HandState_subscriber = ChannelSubscriber(kTopicInspireState, MotorStates_)
self.HandState_subscriber.Init()
# Initialize hand message with motor commands
self.hand_msg = MotorCmds_()
self.hand_msg.cmds = [unitree_go_msg_dds__MotorCmd_() for _ in range(
len(Inspire_Right_Hand_JointIndex) + len(Inspire_Left_Hand_JointIndex)
)]When you need to send hand commands to the robot:
arm.HandCmb_publisher.Write(arm.hand_msg)For a complete reference implementation, see the Unitree XR Teleoperate repository.