ASID-Caption: Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

Yunheng Li¹ · Hengrui Zhang¹ · Meng-Hao Guo³ · Wenzhao Gao² · Shaoyong Jia² · Shaohui Jiao² ·Qibin Hou^1† · Ming-Ming Cheng¹

¹VCIP, Nankai University ² ByteDance Inc. ³ Tsinghua University

†Corresponding author

✨ Overview

Existing video instruction datasets often treat each video as a single unstructured caption, which leads to incomplete descriptions and makes it hard to learn controllable, fine-grained understanding. Simply making captions longer can introduce more hallucinations without systematic verification.

Our key idea is to provide attribute-structured supervision and verify each attribute against audiovisual evidence, enabling more reliable fine-grained learning.

🎬 Captioning Case of ASID-Caption

🚀 Getting Started

1. Clone the repository

First, clone the project and navigate into the directory:

git clone https://github.com/HVision-NKU/ASID-Caption.git
cd ASID-Caption

2. Set Up the Environment

Requires Python 3.11 (pre-installed).

2.1 inference

pip install torch==2.6.0 torchaudio==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
pip install transformers==4.57.0 qwen-omni-utils accelerate
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
sudo apt update && sudo apt install -y ffmpeg

Single Video Inference

python demo_inference.py assets/titanic.mp4

Batch Video Inference

python batch_inference.py --video_dir /demo_test --model_path Model/ASID-Captioner-3B

2.1 Training

cd ms-swift-3.9.3/
pip install -e .
pip install deepspeed==0.18.3 liger-kernel==0.6.4

Stage 1-2

bash train_qwen2.5-omi-stage1-2.sh

Stage 3

bash train_qwen2.5-omi-stage3.sh

📈 Benchmark Evaluation

Audiovisual Caption

video-SALMONN2-testset:

cd eval_scripts/video-SALMONN2-testset
bash eval_video-SALMONN2-test.sh

UGC-VideoCap:

cd eval_scripts/UGC-VideoCap
bash eval_UGC-VideoCap.sh

QA-based Audiovisual Caption

Daily-Omni:

cd eval_scripts/Daily-Omni/
bash Daily-Omni_pipeline.sh

WorldSense:

cd eval_scripts/WorldSense/
bash WorldSense_pipeline.sh

Visual-only Caption

VDC:
```
cd eval_scripts/VDC/VDC.sh
bash VDC.sh
```

VidCapBench-AE

cd eval_scripts/VidCapBench-AE/
bash VidCapBench.sh

Caption-based Temporal Grounding

Charades-STA

cd eval_scripts/Charades/
bash Charades.sh

Attribute-based Instruction Following

python eval_scripts/Attrbute/evaluation.py --caption_file pred.jsonl --prompt_file eval_scripts/Attrbute/prompts.jsonl

🔥 Results

We provide detailed quantitative results on different benchmarks and settings as shown below.

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citing our paper. We appreciate your support!

@article{li2026towards,
  title={Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions},
  author={Li, Yunheng and Zhang, Hengrui and Guo, Meng-Hao and Gao, Wenzhao and Jia, Shaoyong and Jiao, Shaohui and Hou, Qibin and Cheng, Ming-Ming},
  journal={arXiv preprint arXiv:2602.13013},
  year={2026}
}