SIFThinker: Spatially-Aware Image Focus for Visual Reasoning
Instruction
Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware “think-with-images” framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method.
Dataset can be constructed using Data Generation scripts and named SIF-50K. Include SIF-50K-sampled-200.json for SFT, SIF-50K-sampled-200.json for RL. Please place them under data/ folder.
Data Generation
We also provide scripts for reproduing SIF-50K dataset. If you don’t want to produce the dataset again, you can skip this section.
You can run the data generation as the following steps:
step1: Download the VisCoT. Change the image folder and JSON url in Line 313 and 484 in data_generation/produce.py to where you download the VisCoT.
step2: Follow the instructions of DepthAnything to setup environment.
step3: Add api key and llm config to Line 403, 489 and 490 in data_generation/produce.py
step4: Run the data generation script.
cd data_generation
python produce.py
Remark: Same operation can be used to the TallyQA, refer to data_generation/misc/produce_tallyqa.py
Training
Warm up
You can follow LLaMA-Factory for environment setup and SFT training. Our hyperparameter and setting has been included in SFT folder. Specifically:
You can use the setting under SFT/env to setup the environment.
In GRPO-SIF, the key modification lies in the reward function used during training.
Taking Qwen2.5-VL as an example, the reward function is defined in: GRPO-SIF/src/open-r1-multimodal/src/open_r1/vlm_modules/qwen_module.py.
Progressive learning defination is from SIFThinker/GRPO-SIF/src/open-r1-multimodal/src/open_r1/trainer/grpo_trainer.py.
You can run the GRPO-SIF training as the following steps:
step1: Add api key and llm config to Line 167, 168 and 261 in GRPO-SIF/src/open-r1-multimodal/src/open_r1/vlm_modules/qwen_module.py
step2: We use SIF-50K-sampled-200.json for trainning. Please place the dataset ahead under data/ folder.
step3: Run the training script.
bash run_scripts/train_grpo_sif.sh
Merge the weight
Remember to merge the weight after each trainning phases under scripts.
llamafactory-cli export merge.yaml
Inference
You can selectively choose VLLM/Huggingface for inferencing.
API_PORT=8020 llamafactory-cli api inference.yaml
Then, you can use the scripts scripts/infer.py to infer.
Evaluation
We following VisCoT, SpatialBot, SAT, V*, CV-Bench,ect. to eval the results. Some modifications of the scripts are in scripts/evaluation/ folder. (We use vllm-8020 to infer.)
If you find SIFThinker helpful for your work, please cite
@article{chen2025sifthinker,
title={SIFThinker: Spatially-Aware Image Focus for Visual Reasoning},
author={Chen, Zhangquan and Zhao, Ruihui and Luo, Chuwei and Sun, Mingze and Yu, Xinlei and Kang, Yangyang and Huang, Ruqi},
journal={arXiv preprint arXiv:2508.06259},
year={2025}
}
SIFThinker: Spatially-Aware Image Focus for Visual Reasoning
Instruction
Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware “think-with-images” framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method.
Method overview
Env Setup
If the installed trl version conflicts with our repository, replace it with the local copy by running:
Some may need to install:
Data
Dataset is available in Here !!!
SIF-50K
Dataset can be constructed using Data Generation scripts and named SIF-50K. Include
SIF-50K-sampled-200.jsonfor SFT,SIF-50K-sampled-200.jsonfor RL. Please place them underdata/folder.Data Generation
We also provide scripts for reproduing SIF-50K dataset. If you don’t want to produce the dataset again, you can skip this section. You can run the data generation as the following steps:
data_generation/produce.pyto where you download the VisCoT.data_generation/produce.pydata_generation/misc/produce_tallyqa.pyTraining
Warm up
You can follow LLaMA-Factory for environment setup and SFT training. Our hyperparameter and setting has been included in
SFTfolder. Specifically:SFT/envto setup the environment.GRPO-SIF
In GRPO-SIF, the key modification lies in the reward function used during training. Taking Qwen2.5-VL as an example, the reward function is defined in:
GRPO-SIF/src/open-r1-multimodal/src/open_r1/vlm_modules/qwen_module.py. Progressive learning defination is fromSIFThinker/GRPO-SIF/src/open-r1-multimodal/src/open_r1/trainer/grpo_trainer.py.You can run the GRPO-SIF training as the following steps:
GRPO-SIF/src/open-r1-multimodal/src/open_r1/vlm_modules/qwen_module.pySIF-50K-sampled-200.jsonfor trainning. Please place the dataset ahead underdata/folder.Merge the weight
Remember to merge the weight after each trainning phases under
scripts.Inference
You can selectively choose VLLM/Huggingface for inferencing.
Then, you can use the scripts
scripts/infer.pyto infer.Evaluation
We following VisCoT, SpatialBot, SAT, V*, CV-Bench,ect. to eval the results. Some modifications of the scripts are in
scripts/evaluation/folder. (We use vllm-8020 to infer.)Acknowledgement
The repo also benifits form VLM-R1, Open-R1-Multimodel, Visual-CoT, LLaVA, SpatialBot, SAT, V*, OVD-Eval, trl, Cambrian.
Thanks for their wonderful works.
Bibtex
If you find SIFThinker helpful for your work, please cite