Scalable robot learning in the real world is limited by the cost and safety issues of real robots. In addition, rolling out robot trajectories in the real world can be time-consuming and labor-intensive. In this paper, we propose to learn an interactive real-robot action simulator as an alternative. We introduce a novel method, IRASim, which leverages the power of generative models to generate extremely realistic videos of a robot arm that executes a given action trajectory, starting from an initial given frame. To validate the effectiveness of our method, we create a new benchmark, IRASim Benchmark, based on three real-robot datasets and perform extensive experiments on the benchmark. Results show that IRASim outperforms all the baseline methods and is more preferable in human evaluations. We hope that IRASim can serve as an effective and scalable approach to enhance robot learning in the real world. To promote research for generative real-robot action simulators, we open-source code, benchmark, and checkpoints.
Installation
To set up the environment, run the following command:
bash scripts/install.sh
Dataset
To download the complete dataset, run:
bash scripts/download.sh
This table lists the download links and file sizes for the RT-1, Bridge, and Language-Table datasets, categorized into train, evaluation, and checkpoints data.
Download all dataset parts from the Hugging Face page.
Use the provided merge.sh script to merge the downloaded files into multiple ZIP archives.
Extract each ZIP file separately to access the complete dataset.
Language Table Application
We recommend starting with the Language Table application. This application provides a user-friendly keyboard interface to control the robotic arm in an initial image on a 2D plane:
python3 application/languagetable.py
Training
Below are example scripts for training the IRASim-Frame-Ada model on the RT-1 dataset.
To accelerate training, we recommend encoding videos into latent videos first. Our code also supports direct training by setting pre_encode to false.
We provide an automated script to calculate the metrics of the generated short videos:
python3 evaluate/evaluation_short_script.py
Long Trajectory Setting
Generate all long videos in an autoregressive manner.
Generate the scripts for generating long videos in a multi-process manner:
python3 scripts/generate_command.py
Run:
bash scripts/generate_long_video_rt1_frame_ada.sh
Use the automated script to calculate the metrics of the generated long videos:
python3 evaluate/evaluation_long_script.py
Citation
If you find this code useful in your work, please consider citing
@article{FangqiIRASim2024,
title={IRASim: Learning Interactive Real-Robot Action Simulators},
author={Fangqi Zhu and Hongtao Wu and Song Guo and Yuxiao Liu and Chilam Cheang and Tao Kong},
year={2024},
journal={arXiv:2406.12802}
}
If you have any questions during the trial, running or deployment, feel free to join our WeChat group discussion! If you have any ideas or suggestions for the project, you are also welcome to join our WeChat group discussion!
IRASim: Learning Interactive Real-Robot Action Simulators
[Project page] [Paper]
Fangqi Zhu1,2, Hongtao Wu1†*, Song Guo2*, Yuxiao Liu1, Chilam Cheang1, Tao Kong1
1ByteDance Research, 2Hong Kong University of Science and Technology
*Corresponding authors †Project Lead
https://github.com/user-attachments/assets/916034da-f0a7-40c2-8d98-c4c67760cf41
Scalable robot learning in the real world is limited by the cost and safety issues of real robots. In addition, rolling out robot trajectories in the real world can be time-consuming and labor-intensive. In this paper, we propose to learn an interactive real-robot action simulator as an alternative. We introduce a novel method, IRASim, which leverages the power of generative models to generate extremely realistic videos of a robot arm that executes a given action trajectory, starting from an initial given frame. To validate the effectiveness of our method, we create a new benchmark, IRASim Benchmark, based on three real-robot datasets and perform extensive experiments on the benchmark. Results show that IRASim outperforms all the baseline methods and is more preferable in human evaluations. We hope that IRASim can serve as an effective and scalable approach to enhance robot learning in the real world. To promote research for generative real-robot action simulators, we open-source code, benchmark, and checkpoints.
Installation
To set up the environment, run the following command:
Dataset
To download the complete dataset, run:
This table lists the download links and file sizes for the RT-1, Bridge, and Language-Table datasets, categorized into train, evaluation, and checkpoints data.
The complete dataset structure can be found in dataset_structure.txt.
📢 Update (May 20, 2025)
We are excited to announce that the IRASim dataset is now available on Hugging Face:
🔗 https://huggingface.co/datasets/fangqi/IRASim
To reconstruct the full dataset locally:
merge.shscript to merge the downloaded files into multiple ZIP archives.Language Table Application
We recommend starting with the Language Table application. This application provides a user-friendly keyboard interface to control the robotic arm in an initial image on a 2D plane:
Training
Below are example scripts for training the IRASim-Frame-Ada model on the RT-1 dataset.
To accelerate training, we recommend encoding videos into latent videos first. Our code also supports direct training by setting
pre_encodetofalse.Single GPU Training
Multi-GPU Training on a Single Machine
Evaluation
Below are example scripts for evaluating the IRASim-Frame-Ada model on the RT-1 dataset.
Short Trajectory Setting
To quantitatively evaluate the model in the short trajectory setting, we first need to generate all evaluation videos.
Generate evaluation videos:
We provide an automated script to calculate the metrics of the generated short videos:
Long Trajectory Setting
Generate all long videos in an autoregressive manner.
Generate the scripts for generating long videos in a multi-process manner:
Run:
Use the automated script to calculate the metrics of the generated long videos:
Citation
If you find this code useful in your work, please consider citing
Acknowledgement
Discussion Group
If you have any questions during the trial, running or deployment, feel free to join our WeChat group discussion! If you have any ideas or suggestions for the project, you are also welcome to join our WeChat group discussion!