We introduce our first-generation reasoning model, Tiny-R1-32B-Preview, which outperforms the 70B model Deepseek-R1-Distill-Llama-70B and nearly matches the full R1 model in math.
We applied supervised fine-tuning (SFT) to Deepseek-R1-Distill-Qwen-32B across three target domains—Mathematics, Code, and Science — using the 360-LLaMA-Factory training framework to produce three domain-specific models. We used questions from open-source data as seeds. Meanwhile, responses for mathematics, coding, and science tasks were generated by R1, creating specialized models for each domain. Building on this, we leveraged the Mergekit tool from the Arcee team to combine multiple models, creating Tiny-R1-32B-Preview, which demonstrates strong overall performance. For more technical details, please refer to our technical report. Paper Link👁️
Evaluation
Model
Math (AIME 2024)
Coding (LiveCodeBench)
Science (GPQA-Diamond)
Deepseek-R1-Distill-Qwen-32B
72.6
57.2
62.1
Deepseek-R1-Distill-Llama-70B
70.0
57.5
65.2
Deepseek-R1
79.8
65.9
71.5
Tiny-R1-32B-Preview (Ours)
78.1
61.6
65.0
All scores are reported as pass@1.
For AIME 2024, we sample 16 responses, and for GPQA-Diamond, we sample 4 responses, both using average overall accuracy for stable evaluation.
We merged the models trained separately in three directions into a single model. Below are the comparison results. | Model | Math (AIME 2024) | Coding (LiveCodeBench) | Science (GPQA-Diamond) |
| ——————————- | ——————- | ———————– | ———————- |
| Math-Model | 73.1 | - | - |
| Code-Model | - | 63.4 | - |
| Science-Model | - | - | 64.5 |
| Merged-Model (Tiny-R1-32B-Preview) | 78.1 | 61.6 | 65.0
Getting Started
Branch Train
For multi-node training, please first fill in the train/hostfile file. For single-node training, this step is not required.
Note
About hostfile: Each line in the hostfile specifies a node, formatted as <hostname> slots=<num_slots>, where <hostname> is the name of the node and <num_slots> is the number of GPUs available on that node. Here is an example:
git clone https://github.com/TinyR1-32B-Preview.git
cd TinyR1-32B-Preview/mergekit/
pip install -e .
If you encounter the error:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
you can resolve it by following these steps:
Update the package list and install the virtual environment package:
apt-get update -y
apt-get install python3-venv -y
Create a virtual environment and activate the virtual environment:
python3.10 -m venv eval
source eval/bin/activate
After activating the virtual environment, reinstall the required packages. This approach isolates your Python environment from the global packages, thereby preventing dependency conflicts.
sh sh/tinyr1_merge.sh [/path/to/math-model] [/path/to/science-model] [/path/to/code-model] [/path/to/output-model-dir]
The following parameters are mandatory:
[/path/to/math-model]: the path to the math domain model that has been fine-tuned via SFT.
[/path/to/science-model]: the path to the science domain model that has been fine-tuned via SFT.
[/path/to/code-model]: the path to the code domain model that has been fine-tuned via SFT.
[/path/to/output-model-dir]: the path where the fused model will be saved.
Evaluation
We test the resulted models on three kinds of benchmarks, including Math Reasoning, Code Reasoning , and Scientific Reasoning.
Math Reasoning
AIME24
AIME25
Scientific Reasoning
GPQA-Diamond
Code Reasoning
LiveCodeBench (2408-2502)
Math Reasoning
The evaluation code is modified from Qwen2.5-Math. In our evaluation, we set the temperature to 0.6, the top-p to 0.95 and the max_tokens to 32768. We provide the example to reproduce our results in math_evaluation.
The system prompt for evaluation is set to:
Please reason step by step, and put your final answer within \boxed{{}}.
Scientific Reasoning
The evaluation code is modified from FuseO1-Preview. In our evaluation, we set the temperature to 0.6 and the max_tokens to 32768. We provide the example to reproduce our results in science_evaluation.
The system prompt for evaluation is set to:
You are a helpful and harmless assistant. You should think step-by-step.
Code Reasoning
The evaluation code is modified from FuseO1-Preview. In our evaluation, we set the temperature to 0.6, the top-p to 0.95 and the max_tokens to 32768. We provide the example to reproduce our results in code_lcb_evaluation.
The system prompt for evaluation is set to:
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.
Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "qihoo360/TinyR1-32B-Preview"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Please reason step by step, and put your final answer within \boxed{}. Solve the integral: \[I = \int \frac{x^2}{(x+1)^3} \,dx\]"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=4000
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation
| 📑 Paper | 🤗 Hugging Face | 🌐 Blog |
TinyR1 Team
Introduction
We introduce our first-generation reasoning model, Tiny-R1-32B-Preview, which outperforms the 70B model Deepseek-R1-Distill-Llama-70B and nearly matches the full R1 model in math.
We applied supervised fine-tuning (SFT) to Deepseek-R1-Distill-Qwen-32B across three target domains—Mathematics, Code, and Science — using the 360-LLaMA-Factory training framework to produce three domain-specific models. We used questions from open-source data as seeds. Meanwhile, responses for mathematics, coding, and science tasks were generated by R1, creating specialized models for each domain. Building on this, we leveraged the Mergekit tool from the Arcee team to combine multiple models, creating Tiny-R1-32B-Preview, which demonstrates strong overall performance. For more technical details, please refer to our technical report. Paper Link👁️
Evaluation
All scores are reported as pass@1. For AIME 2024, we sample 16 responses, and for GPQA-Diamond, we sample 4 responses, both using average overall accuracy for stable evaluation.
We merged the models trained separately in three directions into a single model. Below are the comparison results.
| Model | Math (AIME 2024) | Coding (LiveCodeBench) | Science (GPQA-Diamond) | | ——————————- | ——————- | ———————– | ———————- | | Math-Model | 73.1 | - | - | | Code-Model | - | 63.4 | - | | Science-Model | - | - | 64.5 | | Merged-Model (Tiny-R1-32B-Preview) | 78.1 | 61.6 | 65.0
Getting Started
Branch Train
For multi-node training, please first fill in the
train/hostfilefile. For single-node training, this step is not required.Installation
To install the required dependencies, run:
Math Model SFT
Hint: Replace BASE_MODEL with the actual path to the base model, e.g., “/model/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B”.
Science Model SFT
Code Model SFT
Merge
Installation
To reproduce the merged qihoo360/TinyR1-32B-Preview model, using the script below.
If you encounter the error:
you can resolve it by following these steps:
Update the package list and install the virtual environment package:
Create a virtual environment and activate the virtual environment:
After activating the virtual environment, reinstall the required packages. This approach isolates your Python environment from the global packages, thereby preventing dependency conflicts.
The following parameters are mandatory:
[/path/to/math-model]: the path to the math domain model that has been fine-tuned via SFT.[/path/to/science-model]: the path to the science domain model that has been fine-tuned via SFT.[/path/to/code-model]: the path to the code domain model that has been fine-tuned via SFT.[/path/to/output-model-dir]: the path where the fused model will be saved.Evaluation
We test the resulted models on three kinds of benchmarks, including Math Reasoning, Code Reasoning , and Scientific Reasoning.
Math Reasoning
Scientific Reasoning
Code Reasoning
Math Reasoning
The evaluation code is modified from Qwen2.5-Math. In our evaluation, we set the temperature to 0.6, the top-p to 0.95 and the max_tokens to 32768. We provide the example to reproduce our results in math_evaluation.
The system prompt for evaluation is set to:
Scientific Reasoning
The evaluation code is modified from FuseO1-Preview. In our evaluation, we set the temperature to 0.6 and the max_tokens to 32768. We provide the example to reproduce our results in science_evaluation.
The system prompt for evaluation is set to:
Code Reasoning
The evaluation code is modified from FuseO1-Preview. In our evaluation, we set the temperature to 0.6, the top-p to 0.95 and the max_tokens to 32768. We provide the example to reproduce our results in code_lcb_evaluation.
The system prompt for evaluation is set to:
Quickstart
Citation