Improving LLM Video Understanding with 16 Frames Per Second

🚀🚀 Welcome to the repo of F-16!

F-16 is a powerful video large language model (LLM) that perceives high-frame-rate videos, which is developed by the Department of Electronic Engineering at Tsinghua University and ByteDance.

🔥 News

2025-07-03: We release the final checkpoint of F-16.
2025-06-18: We release the code of F-16.

⚡️ Future Plans

~~Release the code.~~
~~Release final F-16.~~

🌈 How to Use

How to train a model

Prepare the dataset following scripts/example_sft.json.
Download LLaVA-OneVision Model from huggingface.
Modify the parameters in scripts/train_sft.sh.
Run bash scripts/train_sft.sh.

How to evaluate a checkpoint

Prepare the dataset following scripts/example_sft.json.
Modify the parameters in scripts/eval.sh.
Run bash scripts/eval.sh.

👀 Team

Team Tsinghua: Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Chao Zhang

Team ByteDance: Wei Li, Zejun Ma

✨ Citation

If you find F-16 useful, please cite the paper:

@inproceedings{li2025improving,
  title={Improving LLM Video Understanding with 16 Frames Per Second},
  author={Li, Yixuan and Tang, Changli and Zhuang, Jimin and Yang, Yudong and Sun, Guangzhi and Li, Wei and Ma, Zejun and Zhang, Chao},
  booktitle={Proc. ICML},
  year={2025}, 
  address={Vancouver}
}