Demonstrations generation and training scripts for fly-craft.
Research Papers Based on This Repository
1.Improving the Continuity of Goal-Achievement Ability via Policy Self-Regularization for Goal-Conditioned Reinforcement Learning [ICML 2025]
This research proposed a margin-based policy self-regularization approach to improve the continuity of goal-achievement ability for goal-conditioned reinforcement learning (paper link). Please refer to train_scripts/MSR for the training scripts.
2.VVC-Gym: A Fixed-Wing UAV Reinforcement Learning Environment for Multi-Goal Long-Horizon Problems [ICLR 2025]
This research provided a novel fixed-wing UAV RL environment, demonstrations, and baselines for multi-goal long-horizon problem research(paper link). Please refer to train_scripts/VVCGym for the training scripts.
3.Iterative Regularized Policy Optimization with Imperfect Demonstrations [ICML 2024]
This research proposed Iterative Regularized Policy Optimization to solve the over-constrained exploration problem and the primacy bias problem in offline-to-online learning (paper link). Please refer to train_scripts/IRPO for the training scripts.
Generating Demonstrations
1.Generating with PID controller
Sample from $ V \times \Mu \times \Chi = [v_{min}:v_{max}:v_{interval}] \times [mu_{min}:mu_{max}:mu_{interval}] \times [chi_{min}:chi_{max}:chi_{interval}] $ with PID controller and save sampled trajectories in demonstrations/data/{step-frequence}hz_{$v_{interval}$}_{$mu_{interval}$}_{$chi_{interval}$}_{data-dir-suffix}
Note: This repo depends on imitation (Version 1.0.0). There is a bug in the behavioral cloning (BC) algorithm of this version. Before running BC-related algorithms, it is necessary to modify line 494 of algorithms/bc.py in the imitation library from:
# test SAC on NMR(last 10 observations)
python train_scripts/train_with_rl_sac_her.py --config-file-name configs/train/sac/easy_her_sparse_negative_non_markov_reward_persist_1_sec/sac_config_10hz_128_128_1.json
# test SAC on NMR(last 20 observations)
python train_scripts/train_with_rl_sac_her.py --config-file-name configs/train/sac/easy_her_sparse_negative_non_markov_reward_persist_2_sec/sac_config_10hz_128_128_1.json
# test SAC on NMR(last 30 observations)
python train_scripts/train_with_rl_sac_her.py --config-file-name configs/train/sac/easy_her_sparse_negative_non_markov_reward_persist_3_sec/sac_config_10hz_128_128_1.json
# try solve NMR with framestack
python train_scripts/train_with_rl_sac_her.py --config-file-name configs/train/sac/hard_her_framestack_sparse_negative_non_markov_reward_persist_1_sec/sac_config_10hz_128_128_1.json
Evaluating policies
1.Visualization
The script train_scripts/IRPO/evaluate/rollout_one_trajectory.py can be used to generate .acmi files, which can be used to visualize the flight trajectory with the help of Tacview.
The script train_scripts/IRPO/evaluate/evaluate_policy_by_success_rate.py can be used to evaluate trained policies statistically, which will obtain information such as the success rate, cumulative rewards, and trajectory length of the policy.
fly-craft-examples
Demonstrations generation and training scripts for fly-craft.
Research Papers Based on This Repository
1.Improving the Continuity of Goal-Achievement Ability via Policy Self-Regularization for Goal-Conditioned Reinforcement Learning [ICML 2025]
This research proposed a margin-based policy self-regularization approach to improve the continuity of goal-achievement ability for goal-conditioned reinforcement learning (paper link). Please refer to train_scripts/MSR for the training scripts.
2.VVC-Gym: A Fixed-Wing UAV Reinforcement Learning Environment for Multi-Goal Long-Horizon Problems [ICLR 2025]
This research provided a novel fixed-wing UAV RL environment, demonstrations, and baselines for multi-goal long-horizon problem research(paper link). Please refer to train_scripts/VVCGym for the training scripts.
3.Iterative Regularized Policy Optimization with Imperfect Demonstrations [ICML 2024]
This research proposed Iterative Regularized Policy Optimization to solve the over-constrained exploration problem and the primacy bias problem in offline-to-online learning (paper link). Please refer to train_scripts/IRPO for the training scripts.
Generating Demonstrations
1.Generating with PID controller
Sample from $ V \times \Mu \times \Chi = [v_{min}:v_{max}:v_{interval}] \times [mu_{min}:mu_{max}:mu_{interval}] \times [chi_{min}:chi_{max}:chi_{interval}] $ with PID controller and save sampled trajectories in demonstrations/data/{step-frequence}hz_{$v_{interval}$}_{$mu_{interval}$}_{$chi_{interval}$}_{data-dir-suffix}
2.Updating demonstrations with trained RL policy
Update demonstrations in {demos-dir} with policy in {policy-ckpt-dir}
3.Augment demonstrations
Augment trajectories based on $\chi$’s symmetry
4.Label demonstrations with rewards (support for Offline RL)
Label demonstrations in {demos-dir} with rewards (–traj-prefix is the csv filename’s prefix in the demonstration direction)
5.Process demonstrations (normarlize observations and actions, and concat all csv files) and cache the processed np.ndarray objects
Note: The cache directory should be consistent with the “data_cache_dir” in the training configurations.
Training policies with Stable-baselines3
1.Behavioral Cloning (BC)
Note: This repo depends on imitation (Version 1.0.0). There is a bug in the behavioral cloning (BC) algorithm of this version. Before running BC-related algorithms, it is necessary to modify line 494 of algorithms/bc.py in the imitation library from:
to:
2.Proximal Policy Optimization (PPO)
3.PPO fine-tuning a BC-pre-trained policy
4.Soft Actor-Critic (SAC)
5.SAC with Hindsight Experience Replay (HER)
6.Non-Markovian Reward Problem (NMR)
Evaluating policies
1.Visualization
The script train_scripts/IRPO/evaluate/rollout_one_trajectory.py can be used to generate .acmi files, which can be used to visualize the flight trajectory with the help of Tacview.
2.Statistical Evaluation
The script train_scripts/IRPO/evaluate/evaluate_policy_by_success_rate.py can be used to evaluate trained policies statistically, which will obtain information such as the success rate, cumulative rewards, and trajectory length of the policy.
Citation
Cite as