In our paper, we borrow the data source from TECO for world model adaptation experiments. If you only need a few samples, you can also try the mini sets from Diffusion Forcing, as downloading the TECO datasets may take a couple of days.
Be aware of the different action indices of these two when processing the data. The actions should be extracted as follows: npz_data["action"][:-1] for DMLab and npz_data["action"][1:] for Minecraft.
- Split the collected samples into short video clips consisting of 7 (n_context_frames + 1) frames, and organize them into different folders according to action indices that correspond to the last frame transition. Take the Minecraft with 3 action options as an example:
data/ |--minecraft/ |--action_0/ | |--00000.mp4 | |... |--action_1/ | |--00000.mp4 | |... |--action_2/ |--00000.mp4 |... - Go through each action folder to infer the latent actions using the pretrained latent action encoder. This can be done by running
lam/test.shand settingbatch_sizeinlam/config/lam.yamlto 1. - Uncomment the
on_test_epoch_endfunction inlam/lam/model.pyto save the inferred latent actions aslatent_action_stats.pt. - In MultiSourceSamplerDataset, replace VideoDataset with VideoDatasetDiscreteActionSpace. Please check the parameter inputs and rename all paths if necessary.
- (Optional) Reset the learning rate of the pretrained weights by uncommenting the provided code under
configure_optimizersinworldmodel/vwm/models/diffusion.py. - Use the averaged latent actions as the action embeddings for the discrete action codebook of ActionBook in
worldmodel/vwm/modules/encoders/modules.py. An example is provided in__init__. - Run
worldmodel/run_adaptation_discrete.sh.
- Split the collected samples into short video clips consisting of 7 (n_context_frames + 1) frames, and save the action values that correspond to the last frame transition using the same file name. Take the nuScenes with a two-dimensional action displacement as an example:
The TXT files store a list that contains the displacement [x,y] of each transition.
data/ |--nuscenes/ |--00000.mp4 |--00000.txt |--00001.mp4 |--00001.txt |... - Go through all video clips to infer their latent actions using the pretrained latent action encoder. This can be done by running
lam/test.shand settingbatch_sizeinlam/config/lam.yamlto 1. - Uncomment the
on_test_epoch_endfunction inlam/lam/model.pyto save the inferred latent actions aslatent_action_stats.pt. - In MultiSourceSamplerDataset, replace VideoDataset with VideoDatasetContinuousActionSpace. Please check the parameter inputs and rename all paths if necessary.
- (Optional) Reset the learning rate of the pretrained weights by uncommenting the provided code under
configure_optimizersinworldmodel/vwm/models/diffusion.py. - Convert the ground truth of all actions to
raw_action_inputs.pt, ensuring it corresponds to the order oflatent_action_stats.pt. - Use
worldmodel/fast_init_mlp.pyto optimize the initialization weightsmlp_init_weights.pthfor ActionMLP inworldmodel/vwm/modules/encoders/modules.py. An example is provided in__init__. - Run
worldmodel/run_adaptation_continuous.sh.
To visualize the UMAP projection of latent actions in our paper, please refer to UMAP and set n_neighbors and min_dist to 15 and 0.5, respectively.
<= Previous: [Action Transfer]
=> Next: [Visual Planning]