MMLONGCITE: A Benchmark for Evaluating Faithfulness of Long-Context Vision-Language Models
📢 News
[May, 2026] Major update to MMLongCite: we extend the benchmark up to 128K tokens and refine the task taxonomy for more comprehensive faithfulness evaluation.
[October, 2025] Code and data of MMLongCite are now publicly available.
MMLongCite is a benchmark for evaluating the faithfulness of long-context vision-language models (LCVLMs) through multimodal citation generation.
The benchmark contains 2,280 examples across 8 tasks, covering image-only, image-text interleaved, and video-only contexts. Context lengths span from 8K to 128K tokens.
We also introduce MMLongCite-HR, a high-resolution setting that evaluates fine-grained visual grounding in dense stitched-image inputs. MMLongCite-HR provides two modes: the easy mode stitches 4 images into a single large image with an average resolution of 1K–2K, while the hard mode stitches 16 images into a single large image with an average resolution of 2K–4K.
Figure 1: Task format in MMLongCite.
Figure 2: Statistics of tasks in MMLongCite.
⚙️ Preparation
Environment
Make sure you are in this project folder and then run:
MMLongCite evaluates both citation quality and answer correctness. We use GPT-5.2 as the judge model, and the evaluation scripts support passing in multiple API keys to accelerate evaluation.
Among the outputs, citation_result.json and correctness_result.json store per-case metrics for every example, while citation_score.json and correctness_score.json store the overall aggregated metrics for each task. Example evaluation commands are provided in:
bash script/eval.sh
📝 Citation
If you find our work helpful, please cite our paper:
@article{zhou2025mmlongcite,
title={MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models},
author={Zhou, Keyan and Tang, Zecheng and Ming, Lingfeng and Zhou, Guanghao and Chen, Qiguang and Qiao, Dan and Yang, Zheming and Qin, Libo and Qiu, Minghui and Li, Juntao and others},
journal={arXiv preprint arXiv:2510.13276},
year={2025}
}
MMLONGCITE: A Benchmark for Evaluating Faithfulness of Long-Context Vision-Language Models
📢 News
[May, 2026] Major update to MMLongCite: we extend the benchmark up to 128K tokens and refine the task taxonomy for more comprehensive faithfulness evaluation.
[October, 2025] Code and data of MMLongCite are now publicly available.
🚀 Quick Navigation
🔍 Benchmark Overview
MMLongCite is a benchmark for evaluating the faithfulness of long-context vision-language models (LCVLMs) through multimodal citation generation. The benchmark contains 2,280 examples across 8 tasks, covering image-only, image-text interleaved, and video-only contexts. Context lengths span from 8K to 128K tokens. We also introduce MMLongCite-HR, a high-resolution setting that evaluates fine-grained visual grounding in dense stitched-image inputs. MMLongCite-HR provides two modes: the easy mode stitches 4 images into a single large image with an average resolution of 1K–2K, while the hard mode stitches 16 images into a single large image with an average resolution of 2K–4K.
Figure 1: Task format in MMLongCite.
Figure 2: Statistics of tasks in MMLongCite.
⚙️ Preparation
Environment
Make sure you are in this project folder and then run:
Data Prepare
You can download MMLongCite data from 🤗 Hugging face. Once downloaded, place the data in the root directory of the repository.
The folder structure is organized as follows:
All data in MMLongCite follows the format below:
id: A unique identifier for the data sample.
context: A list containing all the contextual information (e.g., images, text) needed to answer the question.
question: A list containing the specific question to be answered, which may include text and multiple-choice options.
ground_truth: The correct answer for the question.
meta: A dictionary containing additional information for each case, including:
text_length: The length of text content within the context.
mm_length: The length of multi-modal content within the context.
evidence_ids: A list of position identifiers indicating where the supporting evidence is located within the long context.
Here is an example:
🤖️ Inference & Evaluation
Inference
We provide a vLLM-based inference script in
src/infer_vllm.py. Run inference on the main MMLongCite benchmark:Run inference on MMLongCite-HR:
For models with thinking mode enabled:
Prediction files are saved to:
You can also refer to the example script:
Evaluation
MMLongCite evaluates both citation quality and answer correctness. We use GPT-5.2 as the judge model, and the evaluation scripts support passing in multiple API keys to accelerate evaluation.
Citation Evaluation
This produces:
Correctness Evaluation
This produces:
Among the outputs,
citation_result.jsonandcorrectness_result.jsonstore per-case metrics for every example, whilecitation_score.jsonandcorrectness_score.jsonstore the overall aggregated metrics for each task. Example evaluation commands are provided in:📝 Citation
If you find our work helpful, please cite our paper:
🏷️ License
All code within this repository is under Apache License 2.0.