Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

Guoxin Wang^† Jun Zhao Xinyi Liu Yanbo Liu Xuyang Cao Chao Li Zhuoyun Liu Qintian Sun
Fangru Zhou Haoqiang Xing Zhenhong Yang

JDH Algo, JD Health International Inc.

^†Project Lead

📝 Introduction

Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.

🧳 Framework

Citrus-V-Architecture

Model architecture of Citrus-V. The framework consists of three components: (1) an MLLM—including the LLM, tokenizer, and a vision encoder—for high-level visual-textual reasoning such as report generation, VQA, and grounding; (2) a segmentation projector that maps the "[SEG]" token produced by the MLLM into latent segmentation prompts; and (3) a segmentation model that decodes the latent segmentation prompts together with semantic image features into pixel-level masks. Separate image encoders are employed to decouple low-level details for segmentation from high-level semantics for other tasks, ensuring both types of tasks are optimized without semantic conflict.

🚧 Opensource Progress

Release Gradio Demo
Release 33B Model
Release 73B Model
Deploy & Inference

🛠️ Installation

To install Citrus-V:

Create base environment.

conda create -n citrus_v python=3.10 -y
conda activate citrus_v

Install requirements.

git clone https://github.com/jdh-algo/Citrus-V.git
cd Citrus-V
pip install -r requirements_citrus.txt

Install flash-attention according to your environment. Here we used flash-attn==2.7.3.
Install Citrus-V training environment. (Based on ms-swift).
```
pip install -e .
```

🎒 Prepare Model Checkpoints

Make sure you have git-lfs installed and download all the following checkpoints to projects/pretrained_weights.

Download Citrus-V checkpoints:

git lfs install
git clone https://huggingface.co/jdh-algo/Citrus-V-8B-v1.0

📚 Prepare Your Custom Data

We recommend using the official ms-swift documentation to prepare your custom training dataset.

⚓️ Training

Four Training Stages of the Citrus-V. Concept alignment for stable vision–language mapping, comprehension enhancement for enhanced multimodal reasoning, instruction fine-tuning to strengthen instruction-following ability while encoding segmentation intent, and segmentation fine-tuning to adapt SAM2 for precise medical image segmentation.

Training stage 1 & 2

It is recommend to train from stage 3 using the pretrained Citrus-V model.

To train the Citrus-V model from scratch, first build the original model using the following scripts:

python architectures/build_citrus_v_model.py

Training stage 3

View Complete Training Command

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
NPROC_PER_NODE=8 \
MIN_PIXELS=200704 \
MAX_PIXELS=1003520 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift sft \
    --model {pretrained ckpt address} \
    --dataset {your dataset address} \
    --template citrus_v \
    --train_type full \
    --torch_dtype bfloat16 \
    --attn_impl flash_attn \
    --max_length 12288 \
    --num_train_epochs 5 \
    --learning_rate 1e-5 \
    --warmup_ratio 0 \
    --warmup_steps 100 \
    --adam_beta1 0.9 \
    --adam_beta2 0.999 \
    --weight_decay 0.1 \
    --max_grad_norm 1.0 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --dataloader_num_workers 64 \
    --dataset_num_proc 1 \
    --freeze_vit false \
    --freeze_aligner false \
    --freeze_llm false \
    --save_strategy epoch \
    --save_total_limit 8 \
    --logging_steps 5 \
    --output_dir {your model save path}\
    --save_only_model \
    --gradient_checkpointing true \
    --ddp_find_unused_parameters true

Training stage 4

View Complete Training Command

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
NPROC_PER_NODE=8 \
MIN_PIXELS=200704 \
MAX_PIXELS=1003520 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift sft \
    --model {pretrained ckpt address} \
    --dataset {your dataset address} \
    --template citrus_v \
    --train_type full \
    --torch_dtype bfloat16 \
    --attn_impl flash_attn \
    --max_length 12288 \
    --num_train_epochs 5 \
    --learning_rate 1e-5 \
    --warmup_ratio 0 \
    --warmup_steps 100 \
    --adam_beta1 0.9 \
    --adam_beta2 0.999 \
    --weight_decay 0.1 \
    --max_grad_norm 1.0 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --dataloader_num_workers 64 \
    --dataset_num_proc 1 \
    --freeze_vit true \
    --freeze_aligner true \
    --freeze_llm true \
    --freeze_custom_parameters_json {/path/to/projects/vlm_7B_params.json} \
    --save_strategy epoch \
    --save_total_limit 8 \
    --logging_steps 5 \
    --output_dir {your model save path}\
    --save_only_model \
    --gradient_checkpointing true \
    --ddp_find_unused_parameters true

🚀 Deploy & Inference

Deploy

CUDA_VISIBLE_DEVICES=0,1,2,3 \
MAX_PIXELS=65535 \
VIDEO_MAX_PIXELS=50176 \
FPS_MAX_FRAMES=12 \
swift deploy \
    --model /path/to/Citrus-V-8B-v1.0 \
    --served_model_name CitrusV_8B \
    --template citrus_v_infer \
    --infer_backend pt \
    --torch_dtype bfloat16 \
    --port 8000

Inference with Deployment

cd projects
python inference_with_deploy.py

Inference with ms-swift PtEngine

python inference.py --model /path/to/Citrus-V-8B-v1.0

🏛 License

This project is licensed under the Apache License (Version 2.0). For models and datasets, please refer to the original resource page and follow the corresponding License.

📎 Citation

If you use Citrus-V in your research, please cite our work:

@misc{wang2025citrusvadvancingmedicalfoundation,
    title={Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning}, 
    author={Guoxin Wang, Jun Zhao, Xinyi Liu, Yanbo Liu, Xuyang Cao, Chao Li, Zhuoyun Liu, Qintian Sun, Fangru Zhou, Haoqiang Xing and Zhenhong Yang},
    year={2025},
    eprint={2509.19090},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2509.19090}, 
}

🤝 Acknowledgments

We would like to thank the contributors to the ms-swift, SA2VA, SAM2, Qwen2.5-VL, and mmdetection repositories, for their open research and extraordinary work.