mirrors/VersaViT

Installation

conda create -n versavit python=3.11.0
conda activate versavit

pip install --upgrade pip  # enable PEP 660 support
pip install -r requirements.txt

pip install ninja
pip install flash-attn==2.7.4.post1 --no-build-isolation

Quick Start

import torch
from PIL import Image
from transformers import AutoImageProcessor
from models.versavit import VersaViTPretrainedModel


model_path = 'tencent/VersaViT'
processor = AutoImageProcessor.from_pretrained(model_path)
model = VersaViTPretrainedModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map='cuda')

image = Image.open("./assets/versavit_logo.png")
inputs = processor(images=image, return_tensors="pt").to('cuda')
outputs = model.forward_wt_merger(inputs['pixel_values'], inputs['image_grid_thw'])

Data Preparation

We use WebDataset format: training data is stored as .tar archives (shards). Each shard contains sequences of samples (e.g. image + text or other modalities) as separate files within the tar, which allows efficient sequential I/O and scales well for distributed training.

For details on how we build these shards—including downloading with img2dataset, consolidating metadata to .jsonl, and packing into WebDataset .tar files—see the data_pack folder and its README.

Training

Captioning warmup

sh scripts/train_multi_task_webloader_qwen2_warmup.sh exp/cap-only-warmup-qwen2.yaml

Multi-task collaborative training

sh scripts/train_multi_task_webloader_qwen2.sh exp/multi-task-post-train-qwen2.yaml

Evaluation

Segmentation & depth (linear probing)
We provide scripts and configs for linear probing on segmentation and depth. See the evaluation folder: subfolders evaluation/segmentation and evaluation/monodepth contain the respective setup and run instructions.
VQA with LLM
Our VQA training (connecting the vision encoder to an LLM) is done with an internal company framework, so we are unable to open-source that part of the code. We do release the trained model weights for this setup; you can find them here.

🫡 Acknowledgements

Many thanks to the code bases from InternVL and FiT3D.

Citation

If you use this code for your research or project, please cite:

$@ a r t i c l e l i u 2026 v e r s a v i t, t i t l e = V e r s a V i T : E n h a n c i n g M L L M V i s i o n B a c k b o n e s v i a T a s k - G u i d e d O p t i m i z a t i o n, a u t h o r = L i u, Y i k u n a n d L i u, Y u a n a n d D i, S h a n g z h e a n d W a n g, H a i c h e n g a n d Z h a o, Z h o n g y i n a n d T i a n, L e a n d Z h o u, X i a o a n d Z h o u, J i e a n d Y a o, J i a n g c h a o a n d W a n g, Y a n f e n g a n d o t h e r s, j o u r n a l = a r X i v p r e p r i n t a r X i v : 2602.09934, y e a r = 2026$