AncientDoc: Benchmarking Vision-Language Models on Chinese Ancient Documents

📖 Introduction

Chinese ancient documents are invaluable carriers of history and culture, but their visual complexity, linguistic variety, and lack of benchmarks make them challenging for modern Vision-Language Models (VLMs).
We introduce AncientDoc, the first benchmark designed for evaluating VLMs on Chinese ancient documents, covering the full pipeline from OCR to knowledge reasoning.

5 Tasks: Page-level OCR, Vernacular Translation, Reasoning-based QA, Knowledge-based QA, Linguistic Variant QA
14 Categories: 100+ books, ~3,000 pages across dynasties from Warring States to Qing
Rich Annotations: OCR + semantic translation + multi-level QA pairs
Comprehensive Evaluation: CER, Precision/Recall/F1, CHRF++, BERTScore, and human-aligned GPT-4o scoring

🏛 Dataset Overview

Source: Digitized ancient documents from Harvard Library and others
Dynasty Coverage: From Warring States, Han, Tang, Song, Ming to Qing
Category Coverage: 14 semantic categories (e.g., collected works, Chuci-style poetry, medicine, astronomy, literary criticism, art)
Total Size: ~3,000 page images, with annotations across five tasks

🧩 Task Definition

Page-level OCR – extract complete text in correct reading order (vertical right-to-left, with annotations).
Vernacular Translation – translate classical Chinese into modern vernacular.
Reasoning-based QA – infer implicit meanings, causality, and ideology.
Knowledge-based QA – answer factual and cultural questions from texts.
Linguistic Variant QA – recognize rhetorical devices, stylistic features, and literary styles.

📊 Evaluation Metrics

OCR Task: CER, Char Precision/Recall/F1
Translation & QA Tasks: CHRF++, BERTScore (BS-F1)
LLM-as-a-Judge: GPT-4o scoring aligned with human ratings

🚀 Baseline Results

We evaluate open-source (Qwen2.5-VL, InternVL, LLaVA, etc.) and closed-source (GPT-4o, Gemini2.5-Pro, Doubao-V2, etc.) VLMs.

OCR: Gemini2.5-Pro achieves lowest CER (32.03)
Translation: Gemini2.5-Pro leads with BS-F1 72.5
Reasoning QA: Qwen2.5-VL-72B shows strongest implicit reasoning
Knowledge QA: GPT-4o achieves best factual QA performance
Variant QA: GPT-4o & Gemini2.5-Pro excel in stylistic recognition

---

Data Format

Each JSONL file contains:

{
  "image": "class/book/page_001.png",
  "task": "OCR",
  "question": "Please extract the text...",
  "answer": "夫天之所..."
}

📌 Citation

If you use AncientDoc in your research, please cite:

@article{yu2025benchmarking,
  title={Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning},
  author={Haiyang Yu, Yuchuan Wu, Fan Shi, Lei Liao, Jinghui Lu, Xiaodong Ge, Han Wang, Minghan Zhuo, Xuecheng Wu, Xiang Fei, Hao Feng, Guozhi Tang, An-Lan Wang, Hanshen Zhu, Yangfan He, Quanhuan Liang, Liyuan Meng, ChaoFeng, Can Huang, Jingqun Tang, Bin Li},
  journal={arXiv preprint arXiv:2509.09731},
  year={2025}
}

🔗 Resources

📂 Dataset: HuggingFace Link
📑 Paper: arXiv Link
🤖 Baseline Models: Weights & Logs

Data License

The AncientDoc dataset runs under the CC0 license.