OmniRL: Omni Reward and Loss Scheme for Vision-Language R1 Model Training
Due to the success of DeepSeek R1, many researchers have been drawn into the reproduction of R1. However, when attempting to directly use the open-source multimodal R1 framework, various issues are often encountered. Therefore, we have developed a framework called OmniRL based on VLM-R1 that supports training in pure text, text-image, multi-image, and mixed text-only and image-text modes.
GRPO utilizes only two simple rule-based reward functions: format reward and accuracy reward. The implementation of accuracy reward varies by task. For tasks with fixed-answer selection, it is sufficient to determine whether the model’s output matches the ground truth. For LeetCode problems, a compiler is necessary to assess correctness. However, for the majority of generative tasks, the quality of model-generated outputs can be evaluated using three metrics: BLEU, ROUGE, and METEOR. Based on this, we propose using these metrics to directly calculate local matching degree (Rb), critical information coverage (Rr), and semantic coherence (Rm) as reward functions, forming a unified accuracy reward function applicable to reinforcement learning across various tasks.
We utilize generative evaluation metrics such as BLEU and ROUGE as accuracy reward functions, which can be adapted to most tasks.
Training Skills
The reward functions primarily consist of three types: length reward, accuracy reward, and format reward. The values of these three reward functions should not differ too much. For example, if the length reward is greater than 500 while the accuracy reward is between 0 and 1, the convergence of accuracy would be much slower.
Learning the format is quite straightforward and converges quickly, so it doesn’t require much effort.
We have set a parameter called “max_output_token_length” to indicate the maximum number of tokens the model can output. When the response is too long, this parameter needs to be set larger, otherwise the response will be truncated, causing the format reward to remain at 0.
Our Teams
We are the Multimodal Algorithm Team of AIDC. If you are looking for a job, please feel free to contact us: yilei.yi@alibaba-inc.com
If you find this project useful, welcome to cite us.
@misc{long2025OmniRL,
author = {Long, Rujiao and Jin, Ziyu and Wang, Zhan and Huang, Zijin and Cheng, Qiannan and Yi, Lei},
title = {OmniRL: Omni Reward and Loss Scheme for Vision-Language R1 Model Training},
howpublished = {\url{https://github.com/alibaba/OmniRL}},
note = {Accessed: 2025-03-18},
year = {2025}
}
OmniRL: Omni Reward and Loss Scheme for Vision-Language R1 Model Training
Due to the success of DeepSeek R1, many researchers have been drawn into the reproduction of R1. However, when attempting to directly use the open-source multimodal R1 framework, various issues are often encountered. Therefore, we have developed a framework called OmniRL based on VLM-R1 that supports training in pure text, text-image, multi-image, and mixed text-only and image-text modes.
GRPO utilizes only two simple rule-based reward functions: format reward and accuracy reward. The implementation of accuracy reward varies by task. For tasks with fixed-answer selection, it is sufficient to determine whether the model’s output matches the ground truth. For LeetCode problems, a compiler is necessary to assess correctness. However, for the majority of generative tasks, the quality of model-generated outputs can be evaluated using three metrics: BLEU, ROUGE, and METEOR. Based on this, we propose using these metrics to directly calculate local matching degree (Rb), critical information coverage (Rr), and semantic coherence (Rm) as reward functions, forming a unified accuracy reward function applicable to reinforcement learning across various tasks.
Features
Supported inputs
Supported datasets
Reward functions
Training Skills
Our Teams
Setup
Training
Referring Expression Comprehension (REC)
Evaluation
Acknowledgements
We would like to express our sincere gratitude to VLM-R1, DeepSeek, Open-R1, QwenVL, Open-R1-Multimodal, R1-V, RefCOCO, and RefGTA for providing open-source resources that contributed to the development of this project.
Citation
If you find this project useful, welcome to cite us.