1VCIP, Nankai University 2 ByteDance Inc. 3 Tsinghua University
†Corresponding author
✨ Overview
Existing video instruction datasets often treat each video as a single unstructured caption, which leads to incomplete descriptions and makes it hard to learn controllable, fine-grained understanding. Simply making captions longer can introduce more hallucinations without systematic verification.
Our key idea is to provide attribute-structured supervision and verify each attribute against audiovisual evidence, enabling more reliable fine-grained learning.
🎬 Captioning Case of ASID-Caption
🚀 Getting Started
1. Clone the repository
First, clone the project and navigate into the directory:
git clone https://github.com/HVision-NKU/ASID-Caption.git
cd ASID-Caption
We provide detailed quantitative results on different benchmarks and settings as shown below.
✒️ Citation
If you find our work helpful for your research, please consider giving a star ⭐ and citing our paper. We appreciate your support!
@article{li2026towards,
title={Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions},
author={Li, Yunheng and Zhang, Hengrui and Guo, Meng-Hao and Gao, Wenzhao and Jia, Shaoyong and Jiao, Shaohui and Hou, Qibin and Cheng, Ming-Ming},
journal={arXiv preprint arXiv:2602.13013},
year={2026}
}
ASID-Caption: Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions
Yunheng Li1 · Hengrui Zhang1 · Meng-Hao Guo3 · Wenzhao Gao2 · Shaoyong Jia2 · Shaohui Jiao2 ·Qibin Hou1† · Ming-Ming Cheng1
1VCIP, Nankai University 2 ByteDance Inc. 3 Tsinghua University
†Corresponding author
✨ Overview
Existing video instruction datasets often treat each video as a single unstructured caption, which leads to incomplete descriptions and makes it hard to learn controllable, fine-grained understanding. Simply making captions longer can introduce more hallucinations without systematic verification.
Our key idea is to provide attribute-structured supervision and verify each attribute against audiovisual evidence, enabling more reliable fine-grained learning.
🎬 Captioning Case of ASID-Caption
🚀 Getting Started
1. Clone the repository
First, clone the project and navigate into the directory:
2. Set Up the Environment
Requires Python 3.11 (pre-installed).
2.1 inference
Single Video Inference
Batch Video Inference
2.1 Training
Stage 1-2
Stage 3
📈 Benchmark Evaluation
Audiovisual Caption
video-SALMONN2-testset:
UGC-VideoCap:
QA-based Audiovisual Caption
Daily-Omni:
WorldSense:
Visual-only Caption
VDC:
VidCapBench-AE
Caption-based Temporal Grounding
Charades-STA
Attribute-based Instruction Following
🔥 Results
We provide detailed quantitative results on different benchmarks and settings as shown below.
✒️ Citation
If you find our work helpful for your research, please consider giving a star ⭐ and citing our paper. We appreciate your support!