This repository implements code for the paper Robust Imitation of a Few Demonstrations with a Backwards Model.
- Python 3.7+ (used 3.9)
- MuJoCo 2.1.0 (see https://github.com/openai/mujoco-py for installation instructions)
- See
requirements.txt
pip install -r requirements.txt
pip install -e .The requirements include additional libraries such as mujoco-maze and d4rl for the Maze and Adroit environments respectively.
If the packages do not install, clone the repositories directly and run pip install <path to repo>.
mujoco-maze is forked from https://github.com/kngwyu/mujoco-maze and edited to include the custom environments and their goal-oriented versions. See its README.md for more details.
We use Weights and Biases to track experiment results and Hydra for configuration management.
For the Fetch environments, we use the RL Baselines3 Zoo repository . We use the same hyperparameter settings as in the repo and train HER+TQC until convergence on gym==0.21.0.
Some small modifications need to be made to the rl-baselines3-zoo repo. Copy scripts scripts/fetch/gen_demos.py and scripts/fetch/gen_demos.sh to the top-level folder containing the rl-baselines3-zoo code. Place custom environments for Push-v2 and PickAndPlace-v2 inside the rl-baselines3-zoo repo and modify the hyperparameter config file (see the rl-baselines3-zoo README for more details). Then one can train expert policies using the train.py script. To generate demonstrations, run save_demos.sh inside the rl-baselines3-zoo folder.
For the Maze environments, we use goal-oriented SAC [1], where the goal is given as part of the observation. We use the GoalEnv interface provided in gym.
To train the expert policy, run the provided script:
bash scripts/maze/sac_expert.shChange the variable env_id to the desired environment ID. The prefix 'Goal' is used to indicate that the environment is goal-oriented.
The specific environment IDs are:
- GoalPointRegionUMaze-v2
- GoalPointRoom5x11-v1
- GoalPointCorridor7x7-v2
- GoalAntRegionUMaze-v2
- GoalAntRoom5x11-v1
- GoalAntCorridor7x7-v2
To generate demonstrations, see files scripts/maze/gen_demos.py and scripts/maze/gen_demos.sh.
For the Adroit environment, we use the pre-trained policy given in DAPG [2] policy_checkpoint. To generate demonstrations, see scripts/adroit/gen_demos.py and scripts/adroit/gen_demos.sh.
To train and evaluate BMIL, run the following command:
python experiments/bmil.py +experiments=bmil/fetch_pickA run script is provided in scripts/fetch/run.sh. Change the TASK and METHOD variables accordingly.
To train and evaluate BMIL, run the following command:
python experiments/bmil.py +experiments=bmil/maze_point5x11A run script is provided in scripts/maze/run.sh. Change the TASK and METHOD variables accordingly.
To train and evaluate BMIL, run the following command:
python experiments/bmil.py +experiments=bmil/adroit_relocateA run script is provided in scripts/adroit/run.sh. Change the METHOD variable accordingly.
[1] Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Haarnoja et al, 2018.
[2] Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations, Rajeswaran et al, 2018.
If you find this work useful, please cite the paper as follows:
@inproceedings{park2022bmil,
title={Robust Imitation of a Few Demonstrations with a Backwards Model},
author={Park, Jung Yeon and Wong, Lawson L.S.},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2022},
url={https://arxiv.org/abs/2210.09337}
}