Sample from V×U×X=[vmin:vmax:vinterval]×[μmin:μmax:μinterval]×[χmin:χmax:χinterval] with PID controller and save sampled trajectories in demonstrations/data/{step-frequence}hz_{vinterval}_{μinterval}_{χinterval}_{data-dir-suffix}.
Refer to exp_on_d4rl/ for experiments on HalfCheetah and Hopper.
Refer to exp_on_panda/ for experiments on Reach.
Citation
@inproceedings{gong2024iterative,
title={Iterative Regularized Policy Optimization with Imperfect Demonstrations},
author={Xudong, Gong and Dawei, Feng and Kele, Xu and Yuanzhao, Zhai and ChengKang, Yao and Weijia, Wang and Bo, Ding and Huaimin, Wang},
booktitle={International Conference on Machine Learning},
year={2024},
organization={PMLR}
}
IRPO
Official code for “Iterative Regularized Policy Optimization with Imperfect Demonstrations” (ICML2024).
Prepare python environment
Generate Demonstrations
Generate demonstrations with PID controller
Sample from V×U×X=[vmin:vmax:vinterval]×[μmin:μmax:μinterval]×[χmin:χmax:χinterval] with PID controller and save sampled trajectories in demonstrations/data/{step-frequence}hz_{vinterval}_{μinterval}_{χinterval}_{data-dir-suffix}.
Update demonstrations with policy
Update demonstrations in {demos-dir} with policy in {policy-ckpt-dir}.
Augment demonstrations
Augment trajectories based on χ‘s symmetry.
Train policies
Pre-train policy with Behavioral Cloning
Train policy by Behaviroal Cloning with config file in {config-file-name}
Train policy with PPO
Train policy by PPO with config file in {config-file-name}
Fine-tune a pre-trained policy with PPO
Fine-tune a pre-trained policy with PPO with config file in {config-file-name}
More experiments
exp_on_d4rl/for experiments on HalfCheetah and Hopper.exp_on_panda/for experiments on Reach.Citation