Download json annotation files here, including blip_laion_cc_sbu_558k.json for the alignment, shrcap_filtered.json for pre-training, and llava_gpt4v_filtered.json for sft.
Download images following ShareGPT4V, including LAION-CC-SBU-558K, COCO, WebData, SAM, GQA, OCR-VQA, TextVQA, VisualGenome.
# suppose we have 8 GPUs on a machine
# evaluate Mistral 7B v0.3
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run/mistral_eval.sh
# evaluate Gemma 2 9B
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run/gemma_eval.sh
Citation
If you find D-Attn useful for your research and applications, please cite our works:
@article{kuo2025rethinking,
title={D-Attn: Decomposed Attention for Large Vision-and-Language Models},
author={Kuo, Chia-Wen and Zhu, Sijie and Chen, Fan and Shen, Xiaohui and Wen, Longyin},
journal={arXiv preprint arXiv:2502.01906},
year={2025}
}
@article{team2025vidi,
title={Vidi: Large Multimodal Models for Video Understanding and Editing},
author={Vidi Team, and Liu, Celong and Kuo, Chia-Wen and Du, Dawei and Chen, Fan and Chen, Guang and Yuan, Jiamin and Zhang, Lingxi and Guo, Lu and Li, Lusha and others},
journal={arXiv preprint arXiv:2504.15681},
year={2025}
}
D-Attn: Decomposed Attention for Large Vision-and-Language Model
Large Vision-and-Language Model with linear computational complexity for vision modality and stronger VL capability.
D-Attn: Decomposed Attention for Large Vision-and-Language Models [Paper]
Chia-Wen Kuo, Sijie Zhu, Fan Chen, Xiaohui Shen, Longyin Wen
Vidi: Large Multimodal Models for Video Understanding and Editing [Webpage] [Paper] [Code]
Intelligent Editing Team, ByteDance Inc.
Contents
Install
Clone this repository and navigate to the dattn folder
Install packages
Model Weights
Coming soon.
Training
Data preparation
Download json annotation files here, including
blip_laion_cc_sbu_558k.jsonfor the alignment,shrcap_filtered.jsonfor pre-training, andllava_gpt4v_filtered.jsonfor sft.Download images following ShareGPT4V, including
LAION-CC-SBU-558K,COCO,WebData,SAM,GQA,OCR-VQA,TextVQA,VisualGenome.Organize downloaded data as follows:
Mistral 7B v0.3
Gemma 2 9B
Evaluation
Data preparation
Follow LLaVA to download
ScienceQA,MME,GQA,POPE,TextVQA,SEED-Bench,LLaVA-Bench-in-the-Wild,MM-Vet,VQAv2,MMBench, andVisWiz.Follow MMStar to download
MMStarbenchmark.Organize downloaded benchmarks as follows:
Evaluate all
Citation
If you find D-Attn useful for your research and applications, please cite our works: