目录
victor1993tju3个月前3次提交

CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution

Baoliang Tian1*, Yuxuan Si1,2*, Jilong Wang1,3*, Lingyao Li1, Zhongyuan Bao1, Zineng Zhou1,
Tao Wang1†, Sixu Li1, Ziyao Xu1, Mingze Wang1, Zhouzhuo Zhang1, Zhihao Wang1,
Yike Yun1, Ke Tian1, Ning Yang3†, Minghui Qiu1


1ByteDance 2Zhejiang University 3Institute of Automation, Chinese Academy of Sciences *Equal Contribution


arXiv Dataset Project Page GitHub

AAAI 2026 (Oral)


🚀 Introduction

🔥 We will open-source the full CrossCheck-Bench dataset, benchmark suite, and evaluation toolkit. Stay tuned!

Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning and perception ability. However, their compositional robustness under conflicting multimodal signals remains underexplored. Real-world scenarios frequently present contradictions between text and images, requiring models to choose the correct modality or resolve inconsistencies.

CrossCheck-Bench is introduced to systematically diagnose compositional failures in MLLMs under multimodal conflicts. The benchmark consists of:

  • Structured multimodal conflict categories
  • Compositional reasoning tasks under contradictory cues
  • Human-verified conflict annotations
  • Robust evaluation protocol and metrics

Our experiments reveal significant failure modes across state-of-the-art MLLMs, including:

  • Over-reliance on textual cues
  • Incorrect visual grounding
  • Multi-hop reasoning breakdowns
  • Failure on conflict-sensitive attributes

CrossCheck-Bench provides the first comprehensive diagnostic tool for understanding these weaknesses.


📊 Benchmark Details

📝 Dataset Overview

CrossCheck-Bench includes diverse multimodal conflict scenarios, covering:

  • Attribute conflict
  • Logical inconsistencies
  • Text vs. image contradiction
  • Spatial and relational conflicts
  • Multi-entity compositional conflict
  • Instruction override conflict

Each sample contains:

  • A conflicting multimodal input (image + text)
  • Metadata on the conflict type
  • Ground-truth resolution label
  • Reasoning trace (optional)

🔧 Construction Pipeline

✨ Pipeline

The benchmark is constructed via a multi-stage pipeline:

  1. Template-based conflict generation
  2. LLM-assisted conflict mutation
  3. Human verification
  4. Consistency filtering
  5. Compositional augmentation

🛠️ Usage

🔥 Code is coming soon. Stay tuned!

关于
10.8 MB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号