⚠️ Deprecation Notice: This repo is no longer actively maintained. For running RL experiments, please directly use the latest veRL library.
For the archived original documentation, see OLD_README.md.
TinyZero is a reproduction of DeepSeek R1 Zero in countdown and multiplication tasks. We built upon veRL.
Through RL, the 3B base LM develops self-verification and search abilities all on its own.
You can experience the Aha moment yourself for < $30.
conda create -n zero python=3.9
# install torch [or you can skip this step and let vllm install the correct version for you]
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# install vllm
pip3 install vllm==0.6.3 # or you can install 0.5.4, 0.4.2 and 0.3.1
pip3 install ray
# verl
pip install -e .
# flash attention 2
pip3 install flash-attn --no-build-isolation
# quality of life
pip install wandb IPython matplotlib
Countdown task
Data Preparation
conda activate zero
python ./examples/data_preprocess/countdown.py --local_dir {path_to_your_dataset}
Run Training
conda activate zero
For the following code, if you see out-of-VRAM, try adding critic.model.enable_gradient_checkpointing=True to the script, and check out the discussion here.
Single GPU
Works for model <= 1.5B. For Qwen2.5-0.5B base, we know it fails to learn reasoning.
@misc{tinyzero,
author = {Jiayi Pan and Junjie Zhang and Xingyao Wang and Lifan Yuan and Hao Peng and Alane Suhr},
title = {TinyZero},
howpublished = {https://github.com/Jiayi-Pan/TinyZero},
note = {Accessed: 2025-01-24},
year = {2025}
}
TinyZero
TinyZero is a reproduction of DeepSeek R1 Zero in countdown and multiplication tasks. We built upon veRL.
Through RL, the 3B base LM develops self-verification and search abilities all on its own.
You can experience the Aha moment yourself for < $30.
Twitter thread: https://x.com/jiayi_pirate/status/1882839370505621655
Full experiment log: https://wandb.ai/jiayipan/TinyZero
Installation
Countdown task
Data Preparation
Run Training
For the following code, if you see out-of-VRAM, try adding
critic.model.enable_gradient_checkpointing=Trueto the script, and check out the discussion here.Single GPU
Works for model <= 1.5B. For Qwen2.5-0.5B base, we know it fails to learn reasoning.
3B+ model In this case, the base model is able to develop sophisticated reasoning skills.
Instruct Ablation
We experiment with Qwen-2.5-3B Instruct too. Data Preparation To follow chat template, we need to reprocess the data:
Training
Acknowledgements
Citation