Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.
🧳 Framework
Model architecture of Citrus-V. The framework consists of three components:
(1) an MLLM—including the LLM, tokenizer, and a vision encoder—for high-level visual-textual reasoning
such as report generation, VQA, and grounding;
(2) a segmentation projector that maps the "[SEG]" token produced by the MLLM into latent segmentation prompts;
and (3) a segmentation model that decodes the latent segmentation prompts together with semantic image features
into pixel-level masks. Separate image encoders are employed to decouple low-level details for segmentation
from high-level semantics for other tasks, ensuring both types of tasks are optimized without semantic conflict.
Four Training Stages of the Citrus-V. Concept alignment for stable vision–language mapping, comprehension enhancement for enhanced multimodal reasoning, instruction fine-tuning to strengthen instruction-following ability while encoding segmentation intent, and segmentation fine-tuning to adapt SAM2 for precise medical image segmentation.
Training stage 1 & 2
It is recommend to train from stage 3 using the pretrained Citrus-V model.
To train the Citrus-V model from scratch, first build the original model using the following scripts:
This project is licensed under the Apache License (Version 2.0). For models and datasets, please refer to the original resource page and follow the corresponding License.
📎 Citation
If you use Citrus-V in your research, please cite our work:
@misc{wang2025citrusvadvancingmedicalfoundation,
title={Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning},
author={Guoxin Wang, Jun Zhao, Xinyi Liu, Yanbo Liu, Xuyang Cao, Chao Li, Zhuoyun Liu, Qintian Sun, Fangru Zhou, Haoqiang Xing and Zhenhong Yang},
year={2025},
eprint={2509.19090},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.19090},
}
🤝 Acknowledgments
We would like to thank the contributors to the ms-swift, SA2VA, SAM2, Qwen2.5-VL, and mmdetection repositories, for their open research and extraordinary work.
Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning
Fangru Zhou Haoqiang Xing Zhenhong Yang
📝 Introduction
Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.
🧳 Framework
Model architecture of Citrus-V. The framework consists of three components: (1) an MLLM—including the LLM, tokenizer, and a vision encoder—for high-level visual-textual reasoning such as report generation, VQA, and grounding; (2) a segmentation projector that maps the "[SEG]" token produced by the MLLM into latent segmentation prompts; and (3) a segmentation model that decodes the latent segmentation prompts together with semantic image features into pixel-level masks. Separate image encoders are employed to decouple low-level details for segmentation from high-level semantics for other tasks, ensuring both types of tasks are optimized without semantic conflict.
🚧 Opensource Progress
🛠️ Installation
To install Citrus-V:
Create base environment.
Install requirements.
Install flash-attention according to your environment. Here we used
flash-attn==2.7.3.Install Citrus-V training environment. (Based on ms-swift).
🎒 Prepare Model Checkpoints
Make sure you have git-lfs installed and download all the following checkpoints to
projects/pretrained_weights.Download Citrus-V checkpoints:
📚 Prepare Your Custom Data
We recommend using the official ms-swift documentation to prepare your custom training dataset.
⚓️ Training
Four Training Stages of the Citrus-V. Concept alignment for stable vision–language mapping, comprehension enhancement for enhanced multimodal reasoning, instruction fine-tuning to strengthen instruction-following ability while encoding segmentation intent, and segmentation fine-tuning to adapt SAM2 for precise medical image segmentation.
Training stage 1 & 2
It is recommend to train from stage 3 using the pretrained Citrus-V model.
To train the Citrus-V model from scratch, first build the original model using the following scripts:
Training stage 3
View Complete Training Command
Training stage 4
View Complete Training Command
🚀 Deploy & Inference
🏛 License
This project is licensed under the Apache License (Version 2.0). For models and datasets, please refer to the original resource page and follow the corresponding License.
📎 Citation
If you use Citrus-V in your research, please cite our work:
🤝 Acknowledgments
We would like to thank the contributors to the ms-swift, SA2VA, SAM2, Qwen2.5-VL, and mmdetection repositories, for their open research and extraordinary work.