🏅 SaSaSa2VA wins the 1st Place in ICCV 2025 LSVOS Challenge RVOS Track! 🎉🎉🎉
Opensource progress
Release Qwen3-VL related models.
Release InternVL-3-VL related models.
Release Qwen2.5-VL related models.
Release Open-sourced training datasets.
Release Ref-SAM-v dataset.
Release evaluation code for each dataset.
Release 1B,4B,8B, 26B model.
Release training code for 1b, 4b, 8b model.
Release inference and test code.
Release demo code.
Overview
This repository contains the code for the paper “Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos”.
Sa2VA is the first unified model for the dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space.
We provide a script that implements interactive chat using gradio, which requires installing gradio. You can try it to build a local chat interface quickly.
Use uv to manage dependencies. Run uv sync to install everything, choosing the extra based on your model family:
uv sync --extra=legacy for InternVL2.5 or earlier models (legacy Transformers).
uv sync --extra=latest for newer models (latest Transformers).
🚀 Quick Start
Our Sa2VA model is available on 🤗HuggingFace. With very few steps, you can try it with your own data. You can install the demo/requirements.txt to avoid training-only packages.
Option1 - scripts:
Supposing you have a folder (PATH_TO_FOLDER) that contains images of a video, you can use the following script to chat with the Sa2VA model or segment the objects in the videos.
python demo/demo.py PATH_TO_FOLDER --model_path ByteDance/Sa2VA-8B --work-dir OUTPUT_DIR --text "<image>Please describe the video content."
If the output contains the segmentation results, the results will be saved to OUTPUT_DIR.
Option2 - Jupter Notebook:
Please refer to demo.ipynb.
🎥 Demo
Demo 1
Input Video (Source: La La Land 2016):
Instruction: “Please segment the girl wearing the yellow dress.”
Demo 2
Input Video (Source: La La Land 2016):
Instruction: “Please segment the main character.”
Demo 3
Input Video (Source: Internet):
Instruction: “Please segment the person wearing sun glasses.”
Demo 4
Input Video (Source: Internet):
Instruction: “Instruction: “Please segment the singing girl.”
Demo 5
Input Video:
Instruction: “What is the atmosphere of the scene?”
Answer: “The scene has a dark and mysterious atmosphere, with the men dressed in suits and ties, and the dimly lit room.”
Training
Installation
We provide two ways for installation. Using uv is recommended for a faster and more reliable setup.
Option 1: Using uv (Recommended)
First, install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
Then, create a virtual environment and sync the dependencies:
uv sync --extra=latest # or uv sync --extra=legacy for Sa2VA based on InternVL2/2.5
source .venv/bin/activate
Option 2: Using conda and pip
Deprecated.
Pretrained Model Preparation
You are expected to download the following pretrained models and place them in the ./pretrained directory:
Please download the training datasets and place them in the data directory. The download link is here.
Please directly put the zip files into the data directory and unzip them. For example, you can download the video_datas_mevis.zip and unzip it in the data directory like:
Important: sam_v_full is the SA-V dataset, which is not included in the download link. You can download it from Meta (here). Please follow their license.
Training Script
Please run the following script to train using 8 GPUS, we suggest using at least 8 A100 GPUs:
We provide a simple example for fine-tuning Sa2VA on an image referring segmentation task. For detailed instructions, please refer to our fine-tuning guide.
The example dataset is constructed from a few images from RefCOCO. To fine-tune on your own data, you can organize it in the same format as our example annotations.json. You can download the example dataset from Hugging Face.
For other types of data, you may need to customize the dataloader and configuration. Please refer to projects/sa2va/datasets/sa2va_data_finetune.py and projects/sa2va/configs/sa2va_finetune.py for guidance.
If you find this repository useful, please consider referring to the following paper:
@article{sa2va,
title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos},
author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Sun, Yueyi and Huang, Zilong and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan},
journal={arXiv pre-print},
year={2025}
}
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
[🏠 Sa2VA] [📜 arXiv] [🤗 HuggingFace] [Gradio Demo (Ours internal: Sa2VA-4B)] [Gradio Demo (By HuggingFace Offical)] [🤖 Replicate Demo]
Haobo Yuan1* · Xiangtai Li2*† · Tao Zhang2,3* · Yueyi Sun4 · Zilong Huang2 · Shilin Xu4 ·Shunping Ji3 ·Yunhai Tong4 · Lu Qi3 · Jiashi Feng2 · Ming-Hsuan Yang1
1UC Merced 2ByteDance Seed 3WHU 4PKU
† project lead * the first three authors equally contribute to the work.
News
Opensource progress
Overview
This repository contains the code for the paper “Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos”.
Sa2VA is the first unified model for the dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space.
Model Zoo
We provide the following models: | Model Name | Base MLLM | Language Part | HF Link | |:———-:|:—————————————————————–:|:—————————————————————————–:|:—————————————————-:| | Sa2VA-1B | InternVL2.5-1B | Qwen2.5-0.5B-Instruct | 🤗 link | | Sa2VA-4B | InternVL2.5-4B | Qwen2.5-3B-Instruct | 🤗 link | | Sa2VA-8B | InternVL2.5-8B | internlm2_5-7b-chat | 🤗 link | | Sa2VA-26B | InternVL2.5-26B | internlm2_5-20b-chat | 🤗 link | | Sa2VA-InternVL3-2B | InternVL3-2B | Qwen2.5-1.5B | 🤗 link | | Sa2VA-InternVL3-8B | InternVL3-8B | Qwen2.5-7B | 🤗 link | | Sa2VA-InternVL3-14B | InternVL3-14B | Qwen2.5-14B | 🤗 link | | Sa2VA-Qwen2_5-VL-3B | Qwen2.5-VL-3B-Instruct | Qwen2.5-3B | 🤗 link | | Sa2VA-Qwen2_5-VL-7B | Qwen2.5-VL-7B-Instruct | Qwen2.5-7B | 🤗 link | | Sa2VA-Qwen3-VL-2B | Qwen3-VL-2B-Instruct | Qwen3-1.7B | 🤗 link | | Sa2VA-Qwen3-VL-4B | Qwen3-VL-4B-Instruct | Qwen3-4B | 🤗 link |
🤗 Gradio Demos
We provide a script that implements interactive chat using gradio, which requires installing
gradio. You can try it to build a local chat interface quickly.Environment
Use
uvto manage dependencies. Runuv syncto install everything, choosing the extra based on your model family:uv sync --extra=legacyfor InternVL2.5 or earlier models (legacy Transformers).uv sync --extra=latestfor newer models (latest Transformers).🚀 Quick Start
Our Sa2VA model is available on 🤗HuggingFace. With very few steps, you can try it with your own data. You can install the
demo/requirements.txtto avoid training-only packages.Option1 - scripts:
Supposing you have a folder (
PATH_TO_FOLDER) that contains images of a video, you can use the following script to chat with the Sa2VA model or segment the objects in the videos.If the output contains the segmentation results, the results will be saved to
OUTPUT_DIR.Option2 - Jupter Notebook:
Please refer to
demo.ipynb.🎥 Demo
Demo 1
Input Video (Source: La La Land 2016):Instruction: “Please segment the girl wearing the yellow dress.”
Demo 2
Input Video (Source: La La Land 2016):Instruction: “Please segment the main character.”
Demo 3
Input Video (Source: Internet):Instruction: “Please segment the person wearing sun glasses.”
Demo 4
Input Video (Source: Internet):Instruction: “Instruction: “Please segment the singing girl.”
Demo 5
Input Video:Instruction: “What is the atmosphere of the scene?”
Answer: “The scene has a dark and mysterious atmosphere, with the men dressed in suits and ties, and the dimly lit room.”
Training
Installation
We provide two ways for installation. Using
uvis recommended for a faster and more reliable setup.Option 1: Using
uv(Recommended)First, install
uv:Then, create a virtual environment and sync the dependencies:
Option 2: Using
condaandpipDeprecated.
Pretrained Model Preparation
You are expected to download the following pretrained models and place them in the
./pretraineddirectory:You can download the remaining models from InternVL2.5 huggingface collections.
Data Preparation
Please download the training datasets and place them in the
datadirectory. The download link is here.Please directly put the zip files into the
datadirectory and unzip them. For example, you can download thevideo_datas_mevis.zipand unzip it in thedatadirectory like:The final data structure should be like:
Important:
sam_v_fullis the SA-V dataset, which is not included in the download link. You can download it from Meta (here). Please follow their license.Training Script
Please run the following script to train using 8 GPUS, we suggest using at least 8 A100 GPUs:
Fine-tuning
We provide a simple example for fine-tuning Sa2VA on an image referring segmentation task. For detailed instructions, please refer to our fine-tuning guide.
The example dataset is constructed from a few images from RefCOCO. To fine-tune on your own data, you can organize it in the same format as our example
annotations.json. You can download the example dataset from Hugging Face.For other types of data, you may need to customize the dataloader and configuration. Please refer to
projects/sa2va/datasets/sa2va_data_finetune.pyandprojects/sa2va/configs/sa2va_finetune.pyfor guidance.Convert trained model to huggingface format
Please run the following script to convert:
Evaluation
You can download Ref-SAV eval set here🤗.
Image/Video Referring Segmentation Evaluation
Please adopt the following script to test Sa2VA on video object segmentation benchmarks using 8 GPUS.
You can use the following command to evaluate Sa2VA on all segmentation benchmarks at once:
or you can evaluate Sa2VA on single segmentation benchmark(such as ReVOS):
Image/Video QA Evaluation
We use sa2va_eval (a modified version of VLMEvalKit) for Image/Video Chat benchmark evaluation.
Single-GPU Evaluation Example:
Multi-GPU Evaluation Example:
References
If you find this repository useful, please consider referring to the following paper: