1ByteDance 2Zhejiang University 3Institute of Automation, Chinese Academy of Sciences *Equal Contribution
AAAI 2026 (Oral)
🚀 Introduction
🔥 We will open-source the full CrossCheck-Bench dataset, benchmark suite, and evaluation toolkit. Stay tuned!
Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning and perception ability. However, their compositional robustness under conflicting multimodal signals remains underexplored. Real-world scenarios frequently present contradictions between text and images, requiring models to choose the correct modality or resolve inconsistencies.
CrossCheck-Bench is introduced to systematically diagnose compositional failures in MLLMs under multimodal conflicts. The benchmark consists of:
Structured multimodal conflict categories
Compositional reasoning tasks under contradictory cues
Human-verified conflict annotations
Robust evaluation protocol and metrics
Our experiments reveal significant failure modes across state-of-the-art MLLMs, including:
Over-reliance on textual cues
Incorrect visual grounding
Multi-hop reasoning breakdowns
Failure on conflict-sensitive attributes
CrossCheck-Bench provides the first comprehensive diagnostic tool for understanding these weaknesses.
📊 Benchmark Details
📝 Dataset Overview
CrossCheck-Bench includes diverse multimodal conflict scenarios, covering:
Attribute conflict
Logical inconsistencies
Text vs. image contradiction
Spatial and relational conflicts
Multi-entity compositional conflict
Instruction override conflict
Each sample contains:
A conflicting multimodal input (image + text)
Metadata on the conflict type
Ground-truth resolution label
Reasoning trace (optional)
🔧 Construction Pipeline
✨ Pipeline
The benchmark is constructed via a multi-stage pipeline:
CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution
Baoliang Tian1*, Yuxuan Si1,2*, Jilong Wang1,3*, Lingyao Li1, Zhongyuan Bao1, Zineng Zhou1,
Tao Wang1†, Sixu Li1, Ziyao Xu1, Mingze Wang1, Zhouzhuo Zhang1, Zhihao Wang1,
Yike Yun1, Ke Tian1, Ning Yang3†, Minghui Qiu1
1ByteDance 2Zhejiang University 3Institute of Automation, Chinese Academy of Sciences *Equal Contribution
AAAI 2026 (Oral)
🚀 Introduction
🔥 We will open-source the full CrossCheck-Bench dataset, benchmark suite, and evaluation toolkit. Stay tuned!
Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning and perception ability. However, their compositional robustness under conflicting multimodal signals remains underexplored. Real-world scenarios frequently present contradictions between text and images, requiring models to choose the correct modality or resolve inconsistencies.
CrossCheck-Bench is introduced to systematically diagnose compositional failures in MLLMs under multimodal conflicts. The benchmark consists of:
Our experiments reveal significant failure modes across state-of-the-art MLLMs, including:
CrossCheck-Bench provides the first comprehensive diagnostic tool for understanding these weaknesses.
📊 Benchmark Details
📝 Dataset Overview
CrossCheck-Bench includes diverse multimodal conflict scenarios, covering:
Each sample contains:
🔧 Construction Pipeline
✨ Pipeline
The benchmark is constructed via a multi-stage pipeline:
🛠️ Usage
🔥 Code is coming soon. Stay tuned!