目录

Improving LLM Video Understanding with 16 Frames Per Second

🚀🚀 Welcome to the repo of F-16!

F-16 is a powerful video large language model (LLM) that perceives high-frame-rate videos, which is developed by the Department of Electronic Engineering at Tsinghua University and ByteDance.

🔥 News

  • 2025-07-03: We release the final checkpoint of F-16.
  • 2025-06-18: We release the code of F-16.

⚡️ Future Plans

  • Release the code.
  • Release final F-16.

🌈 How to Use

How to train a model

  1. Prepare the dataset following scripts/example_sft.json.
  2. Download LLaVA-OneVision Model from huggingface.
  3. Modify the parameters in scripts/train_sft.sh.
  4. Run bash scripts/train_sft.sh.

How to evaluate a checkpoint

  1. Prepare the dataset following scripts/example_sft.json.
  2. Modify the parameters in scripts/eval.sh.
  3. Run bash scripts/eval.sh.

👀 Team

Team Tsinghua: Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Chao Zhang

Team ByteDance: Wei Li, Zejun Ma

✨ Citation

If you find F-16 useful, please cite the paper:

@inproceedings{li2025improving,
  title={Improving LLM Video Understanding with 16 Frames Per Second},
  author={Li, Yixuan and Tang, Changli and Zhuang, Jimin and Yang, Yudong and Sun, Guangzhi and Li, Wei and Ma, Zejun and Zhang, Chao},
  booktitle={Proc. ICML},
  year={2025}, 
  address={Vancouver}
}
关于
100.0 KB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号