$\mathcal{D}$ -Attn: Decomposed Attention for Large Vision-and-Language Model

Large Vision-and-Language Model with linear computational complexity for vision modality and stronger VL capability.

$\mathcal{D}$ -Attn: Decomposed Attention for Large Vision-and-Language Models [Paper]
Chia-Wen Kuo, Sijie Zhu, Fan Chen, Xiaohui Shen, Longyin Wen

Vidi: Large Multimodal Models for Video Understanding and Editing [Webpage] [Paper] [Code]
Intelligent Editing Team, ByteDance Inc.

Install
Model Weights
Training
Evaluation

Install

Clone this repository and navigate to the dattn folder

git clone https://github.com/bytedance/DecomposedAttention
cd DecomposedAttention

Install packages

conda create -n dattn python=3.11 -y
conda activate dattn
bash run/install.sh

Model Weights

Coming soon.

Training

Data preparation

Download json annotation files here, including blip_laion_cc_sbu_558k.json for the alignment, shrcap_filtered.json for pre-training, and llava_gpt4v_filtered.json for sft.

Download images following ShareGPT4V, including LAION-CC-SBU-558K, COCO, WebData, SAM, GQA, OCR-VQA, TextVQA, VisualGenome.

Organize downloaded data as follows:

DecomposedAttention
├── ...
├── data
│   ├── blip_laion_cc_sbu_558k.json
│   ├── shrcap_filtered.json
│   ├── llava_gpt4v_filtered.json
│   ├── train
│   |   ├── llava
│   │   │   ├── llava_pretrain
│   │   │   │   ├── images
│   │   ├── coco
│   │   │   ├── train2017
│   │   ├── sam
│   │   │   ├── images
│   │   ├── gqa
│   │   │   ├── images
│   │   ├── ocr_vqa
│   │   │   ├── images
│   │   ├── textvqa
│   │   │   ├── train_images
│   │   ├── vg
│   │   │   ├── VG_100K
│   │   │   ├── VG_100K_2
│   │   ├── share_textvqa
│   │   │   ├── images
│   │   ├── web-celebrity
│   │   │   ├── images
│   │   ├── web-landmark
│   │   │   ├── images
│   │   ├── wikiart
│   │   │   ├── images
├── ...

Mistral 7B v0.3

# all training ckpts will be stored in the ./checkpoints folder
mkdir -p checkpoints

# multimodal alignment stage
bash run/mistral_aln.sh

# multimodal pre-training stage
bash run/mistral_pt.sh

# instruction tuning stage
bash run/mistral_it.sh

Gemma 2 9B

# all training ckpts will be stored in the ./checkpoints folder
mkdir -p checkpoints

# multimodal alignment stage
bash run/gemma_aln.sh

# multimodal pre-training stage
bash run/gemma_pt.sh

# instruction tuning stage
bash run/gemma_it.sh

Evaluation

Data preparation

Follow LLaVA to download ScienceQA, MME, GQA, POPE, TextVQA, SEED-Bench, LLaVA-Bench-in-the-Wild, MM-Vet, VQAv2, MMBench, and VisWiz.

Follow MMStar to download MMStar benchmark.

Organize downloaded benchmarks as follows:

DecomposedAttention
├── ...
├── data
│   ├── val
│   |   ├── scienceqa
│   │   ├── MME
│   │   ├── gqa
│   │   ├── pope
│   │   ├── textvqa
│   │   ├── seed_bench
│   │   ├── llava-bench-in-the-wild
│   │   ├── mm-vet
│   │   ├── vqav2
│   │   ├── mmbench
│   │   ├── vizwiz
├── ...

Evaluate all

# suppose we have 8 GPUs on a machine

# evaluate Mistral 7B v0.3
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run/mistral_eval.sh

# evaluate Gemma 2 9B
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash run/gemma_eval.sh

Citation

If you find $\mathcal{D}$ -Attn useful for your research and applications, please cite our works:

@article{kuo2025rethinking,
  title={D-Attn: Decomposed Attention for Large Vision-and-Language Models},
  author={Kuo, Chia-Wen and Zhu, Sijie and Chen, Fan and Shen, Xiaohui and Wen, Longyin},
  journal={arXiv preprint arXiv:2502.01906},
  year={2025}
}

@article{team2025vidi,
  title={Vidi: Large Multimodal Models for Video Understanding and Editing},
  author={Vidi Team, and Liu, Celong and Kuo, Chia-Wen and Du, Dawei and Chen, Fan and Chen, Guang and Yuan, Jiamin and Zhang, Lingxi and Guo, Lu and Li, Lusha and others},
  journal={arXiv preprint arXiv:2504.15681},
  year={2025}
}

D\mathcal{D}D-Attn: Decomposed Attention for Large Vision-and-Language Model

Contents

Install

Model Weights

Training

Data preparation

Mistral 7B v0.3

Gemma 2 9B

Evaluation

Data preparation

Evaluate all

Citation

$\mathcal{D}$ -Attn: Decomposed Attention for Large Vision-and-Language Model