import torch
from PIL import Image
from transformers import AutoImageProcessor
from models.versavit import VersaViTPretrainedModel
model_path = 'tencent/VersaViT'
processor = AutoImageProcessor.from_pretrained(model_path)
model = VersaViTPretrainedModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map='cuda')
image = Image.open("./assets/versavit_logo.png")
inputs = processor(images=image, return_tensors="pt").to('cuda')
outputs = model.forward_wt_merger(inputs['pixel_values'], inputs['image_grid_thw'])
Data Preparation
We use WebDataset format: training data is stored as .tar archives (shards). Each shard contains sequences of samples (e.g. image + text or other modalities) as separate files within the tar, which allows efficient sequential I/O and scales well for distributed training.
For details on how we build these shards—including downloading with img2dataset, consolidating metadata to .jsonl, and packing into WebDataset .tar files—see the data_pack folder and its README.
Training
Captioning warmup
sh scripts/train_multi_task_webloader_qwen2_warmup.sh exp/cap-only-warmup-qwen2.yaml
Multi-task collaborative training
sh scripts/train_multi_task_webloader_qwen2.sh exp/multi-task-post-train-qwen2.yaml
Evaluation
Segmentation & depth (linear probing) We provide scripts and configs for linear probing on segmentation and depth. See the evaluation folder: subfolders evaluation/segmentation and evaluation/monodepth contain the respective setup and run instructions.
VQA with LLM Our VQA training (connecting the vision encoder to an LLM) is done with an internal company framework, so we are unable to open-source that part of the code. We do release the trained model weights for this setup; you can find them here.
🫡 Acknowledgements
Many thanks to the code bases from InternVL and FiT3D.
Citation
If you use this code for your research or project, please cite:
Installation
Quick Start
Data Preparation
We use WebDataset format: training data is stored as
.tararchives (shards). Each shard contains sequences of samples (e.g. image + text or other modalities) as separate files within the tar, which allows efficient sequential I/O and scales well for distributed training.For details on how we build these shards—including downloading with img2dataset, consolidating metadata to
.jsonl, and packing into WebDataset.tarfiles—see the data_pack folder and its README.Training
Captioning warmup
Multi-task collaborative training
Evaluation
Segmentation & depth (linear probing)
We provide scripts and configs for linear probing on segmentation and depth. See the evaluation folder: subfolders
evaluation/segmentationandevaluation/monodepthcontain the respective setup and run instructions.VQA with LLM
Our VQA training (connecting the vision encoder to an LLM) is done with an internal company framework, so we are unable to open-source that part of the code. We do release the trained model weights for this setup; you can find them here.
🫡 Acknowledgements
Many thanks to the code bases from InternVL and FiT3D.
Citation
If you use this code for your research or project, please cite:
@articleliu2026versavit,title=VersaViT:EnhancingMLLMVisionBackbonesviaTask−GuidedOptimization,author=Liu,YikunandLiu,YuanandDi,ShangzheandWang,HaichengandZhao,ZhongyinandTian,LeandZhou,XiaoandZhou,JieandYao,JiangchaoandWang,Yanfengandothers,journal=arXivpreprintarXiv:2602.09934,year=2026