We introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understanding and editing (VUE) scenarios. The first release focuses on temporal retrieval (TR), i.e., identifying the time ranges in input videos corresponding to a given text query. The second release evolves toward a foundation model with state-of-the-art spatio-temporal grounding (STG) and temporal retrieval capability while maintaining basic open-ended video QA performance.
Release
[01/20/2026] 🔥 Vidi2.5 released with updated report, github, and demo. VUE_PLOT benchmark and Vidi1.5-9B weight with finetune code are included.
The demo has been updated at https://vidi.byteintl.com/. The demo has two pages, including vidi base page, and vidi-edit page.
Vidi
Select a mode from [“Grounding”, “Retrieval”, “Chapter”, “Highlight”, “VQA”, “Thinking”] on the segmented button. Please use English query for the best experience.
“Grounding”: Input a text query indicating the object to be searched. The model will find the clips corresponding to text query with bounding boxes on the object.
“Retrieval”: Input a text query to be searched. The model will find the clips corresponding to text query.
“Chapter”: No input query needed. Directly output a set of chapters with title.
“Highlight”: No input query needed. Directly output a set of highlight clips with title.
“VQA”: Input a question/instruction about the video. The model will answer the question.
“Thinking”: Input a question/instruction about the video. The model will think and answer the question.
Click “Upload” button and select a video local file (mp4 format). Make sure the video is not corrupted, and the resolution is not too high. 480p is recommended for fast uploading and decoding.
After the video is uploaded, wait till the uploading is finished and the video is ready to play in the box.
Enter the text query if needed. Click the “Send” button.
Wait till the result clips show in the chat box. This could take several minutes for long video.
Vidi-Edit
Select the “Edit” page. Upload multiple videos and click generate button. It will automatically output an edited video with storyline, music, effect, etc.
Evaluation (VUE-STG)
We release the video ids, ground-truth annotation and evaluation results in csv files. Follow the instruction in VUE_STG/README.md to conduct evaluation.
We release the ground-truth annotation and evaluation results in 5 json files. Run the script for a standalone evaluation:
cd VUE_TR_V2
bash install.sh
python3 -u qa_eval.py --pred_path results_Vidi.json
The result figures will be saved in the output folder (‘./results’ by default)
.
For evaluation of new models, first download the videos based on the ids in VUE_TR_V2/video_id.txt from Youtube (e.g., yt-dlp
). Then run inference and save the results in the following format:
[
{
"query_id": 0,
"video_id": "6Qv-LrXJjSM",
"duration": 3884.049,
"query": "The slide showcases Taco Bell's purple ang pow for Chinese New Year, while a woman explains that purple symbolizes royalty in the Chinese tradition.",
"answer": [
[
913.1399199,
953.5340295
]
],
"task": "temporal_retrieval"
},
...
]
You may find the instruction and data for the previous version (VUE-TR) here.
Evaluation (VUE-PLOT)
We release the VUE-PLOT benchmark for plot understanding with two tracks, including character and reasoning. Follow the instruction in VUE_PLOT/readme.md to conduct evaluation.
To evaluate your own model:
You can obtain the raw videos either using the YouTube video IDs or, alternatively, by downloading them from the Condensed Movies dataset homepage.
Generate the results of your own model and follow the instruction in VUE_PLOT/readme.md to finish the evaluation.
Vidi2.5: Large Multimodal Models for Video Understanding and Creation
Homepage: https://bytedance.github.io/vidi-website/
Release
Content
Demo
The demo has been updated at https://vidi.byteintl.com/. The demo has two pages, including vidi base page, and vidi-edit page.
Vidi
“Grounding”: Input a text query indicating the object to be searched. The model will find the clips corresponding to text query with bounding boxes on the object.
“Retrieval”: Input a text query to be searched. The model will find the clips corresponding to text query.
“Chapter”: No input query needed. Directly output a set of chapters with title.
“Highlight”: No input query needed. Directly output a set of highlight clips with title.
“VQA”: Input a question/instruction about the video. The model will answer the question.
“Thinking”: Input a question/instruction about the video. The model will think and answer the question.
Vidi-Edit
Select the “Edit” page. Upload multiple videos and click generate button. It will automatically output an edited video with storyline, music, effect, etc.
Evaluation (VUE-STG)
We release the video ids, ground-truth annotation and evaluation results in csv files. Follow the instruction in VUE_STG/README.md to conduct evaluation.
To evaluate your own model:
Evaluation (VUE-TR-V2)
We release the ground-truth annotation and evaluation results in 5 json files. Run the script for a standalone evaluation:
The result figures will be saved in the output folder (‘./results’ by default) .
For evaluation of new models, first download the videos based on the ids in VUE_TR_V2/video_id.txt from Youtube (e.g., yt-dlp ). Then run inference and save the results in the following format:
You may find the instruction and data for the previous version (VUE-TR) here.
Evaluation (VUE-PLOT)
We release the VUE-PLOT benchmark for plot understanding with two tracks, including character and reasoning. Follow the instruction in VUE_PLOT/readme.md to conduct evaluation. To evaluate your own model:
Model Inference and Finetune
To conduct inference and finetuning for Vidi1.5-9B, follow the instructions in Vidi1.5_9B/README.md.
To conduct inference for Vidi-7B, follow the instructions in Vidi_7B/README.md.
Citation
If you find Vidi useful for your research and applications, please cite using this BibTeX: