Simulation Distillation
Pretraining World Models in Simulation for Rapid Real-World Adaptation
3D Planning Demo
SimDist plans in a latent world model pretrained in simulation. Below we reconstruct and visualize the latent plans — play with them yourself!
The Simulation Distillation Pipeline
SimDist distills structural priors from large-scale mixed-quality simulation data into a latent world model, then rapidly improves real-world performance by planning with the model while finetuning its dynamics predictions.
Reliable improvement with simple, supervised system identification!
Train state-based expert policy.
Perturb expert actions to generate large-scale diverse dataset.
Distill simulation data into a world model from raw perception and deploy it with online planning.
Finetune dynamics predictions with real-world data to improve planning performance.
Rapid Performance Improvement
Simulation Distillation (SimDist) rapidly overcomes the sim-to-real dynamics gap through adaptation in the real world, resulting in substantial gains in task execution on both precise manipulation and quadrupedal locomotion tasks.
Higher Task Throughput
SimDist increases task throughput by improving both reliability and execution speed after real-world adaptation, enabling more successful task completions in less time.
Added Robustness
SimDist improves robustness to external disturbances, allowing the adapted planner to recover from unexpected physical perturbations while continuing toward the task goal.
Catastrophic Forgetting
Existing end-to-end reinforcement learning methods often collapse when finetuning policies in new domains, indicating catastrophic forgetting of pretraining priors. These algorithms entangle learning representations, dynamics, and returns, forcing relearning of the entire task structure in the new domain.
World Models
Task Structure
Our Key Insight: world models automatically decompose task structure in a form that we can exploit to target adaptation where it’s needed. We argue that the encoder, rewards, and value function capture the global structure of the problem in a form that is largely invariant sim-to-real. Thus, we freeze these components during the real world finetuning phase, and focus on finetuning only the dynamics model. This sidesteps the need for end-to-end learning with sparse real-world data and avoids long-horizon credit assignment, which is a central challenge for existing RL approaches.
Transferring State Representations
Transferring State Representations
In order to reliably transfer from sim-to-real, the encoder must learn a valid state representation for the real world environment. Below, we display images which are reconstructed from the latent states predicted by the world model. This demonstrates how the encoder — trained entirely in simulation — captures a robust and accurate representation for the real world.
Note: we do not train the world model with a reconstruction loss. These images are produced by an auxiliary probe that was trained to predict real images from encoded latent states.
Freezing and Transferring Task Structure
Transferring Value Functions
In order to be useful for planning, the value function we transfer from simulation does not need to exactly model real-world returns. Instead, the planner only needs the value function to accurately discriminate between high quality and low-quality states in the real world. This enables the planner to reason counterfactually at test time to improve performance. Below, we see that the value function is able to accurately discriminate successful and failed trajectories in the real world.
World Models
Adapting Dynamics Prediction
Adapting the dynamics model is essential for effective planning, as both the reward and value estimates are computed over predicted trajectories. By freezing the encoder, we reduce adaptation to a simple supervised learning problem in a low-dimensional latent space. This yields an extremely simple learning problem which can be reliably solved in low-data regimes.
Finetuning drastically lowers dynamics prediction loss for a held out quadruped slippery slope trajectory.
During this trajectory, the front-left foot slips.
At this same instant, the finetuned model correctly anticipates the future slippage, while the pretrained model fails to do so.
Real-World Results
Success rate for two manipulation tasks, computed over 20 trials, and average forward progress for two quadruped locomotion tasks, averaged across all 15 trials (3 speeds, 5 trials each), as a function of real-world finetuning data. For manipulation, we consider two difficulties: initial conditions drawn from a Narrow or Wide grid.
Interactive charts load here from assets/data/results.json.