This repository is the official implementation of the training code for DexWild. It includes the training pipeline for both single data source models and cotraining with many different data buffers (e.g. human and robot).
It is a modified version the DiT-Policy repository. For detailed description of the source codebase, please refer to The Ingredients for Robotic Diffusion Transformers
Our repository is easy to install using miniconda or anaconda:
conda env create -f env.yml
conda activate dexwildtrain
pip install git+https://github.com/AGI-Labs/robobuf.git
pip install git+https://github.com/facebookresearch/r3m.git
pip install -e ./
pre-commit install # required for pushing back to the source git
# download the visual features
./download_features.shNext, set up the accelerate config.
accelerate config
# choose:
This machine
No distributed training
No to CPU only
No to torch dynamo
No to Deepspeed
all GPUs
Yes to numa efficiency
bf16 mixed precisionTo train policies, data must be in Robobuf format.
An example shell script is below.
accelerate launch finetune_accelerate.py --multirun exp_name=example_single_data_source batch_size=30 agent=transformer \
agent.use_obs=add_token \
agent.use_lang=false\
task=umi_hand_fourcam_hybrid \
task.train_buffer.hist_augment=false \
task.train_buffer.img_hist_frame_indices="[0]" \
task.train_buffer.state_hist_frame_indices="[0, 9, 18]" \
agent/features=vit_base \
task.obs_dim=9 \
lr=0.0001 \
ac_chunk=48 \
trainer=bc_cos_sched \
buffer_path=path_to_buffer \
max_iterations=500000 \
agent.features.restore_path=/home/$USER/dit-policy/visual_features/vit_base/SOUP_1M_DH.pthAn example shell script is below.
accelerate launch finetune_accelerate.py --multirun exp_name=example_multiply_data_source batch_size=30 agent=transformer_cotrain \
agent.use_obs=add_token \
agent.use_lang=false\
task=cotrain_umi_hand_fourcam_hybrid \
agent/features=vit_base \
task.obs_dim=27 \
lr=0.00015 \
ac_chunk=48 \
trainer=bc_cos_sched \
buffer_paths_list="[human buffer path, robot buffer path]" \
task.cotrain_weights="[0.6667,0.3333]" \
max_iterations=500000 \
agent.features.restore_path=/home/$USER/dit-policy/visual_features/vit_base/SOUP_1M_DH.pthIn each case, the main parameters that must be adjusted are listed below
agent=transformer # Training Model, transformer or diffusion_unet
agent.use_obs=add_token \ # How observations are embedded (e.g., add_token for visual tokens)
agent.use_lang=false \ # Whether to include language input (usually false for DexWild)
task=umi_hand_fourcam_hybrid \ # Task config defining modalities, frame history, and dataset structure
task.obs_dim=9 \ # Dimensionality of the observation vector (e.g., proprio + hand states)
lr=0.0001 \ # Learning rate for optimization
ac_chunk=48 \ # Number of timesteps the policy predicts per inference (action chunk size)
buffer_path=path_to_buffer \ # Path to single Robobuf dataset
# If cotraining
agent=transformer_cotrain # Training Model, transformer_cotrain or diffusion_unet_cotrain
buffer_paths_list="[human, robot]" \ # List of Robobuf dataset paths (e.g., [human, robot])
task.cotrain_weights="[0.6667,0.3333]" # Weighting of each dataset for loss blending (should sum to 1.0)If you want to change the inputs and outputs of the policies, this can be found under:
cd experiments/task
# yaml config files for cotraining
cotrain_dexwild_fourcam.yaml
cotrain_dexwild_twocam.yaml
# yaml config files for single data source
dexwild_fourcam.yaml
dexwild_twocam.yamlYou can easily download our pre-trained represenations using the provided script: ./download_features.sh. You may also download the features individually on our release website.
If you find this codebase or the diffusion transformer useful, please cite our works:
@article{tao2025dexwild,
title={DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies},
author={Tao, Tony and Srirama, Mohan Kumar and Liu, Jason Jingzhou and Shaw, Kenneth and Pathak, Deepak},
journal={Robotics: Science and Systems (RSS)},
year={2025}}
@article{dasari2024ditpi,
title={The Ingredients for Robotic Diffusion Transformers},
author = {Sudeep Dasari and Oier Mees and Sebastian Zhao and Mohan Kumar Srirama and Sergey Levine},
journal = {arXiv preprint arXiv:2410.10088},
year={2024},
}