目录

WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild

📑 Paper | 🤗 Dataset | 🐙 GitHub

This repository contains the evaluation code for the paper “[WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild]“.


🔔 Introduction

WildSpeech Overview

WildSpeech-Bench is the first benchmark for evaluating the speech-to-speech (S2S) capabilities of speechLLMs, characterized by both its evaluation framework and its construction process.

🪝 Construction

WildSpeech Overview

Our benchmark construction process directly counters the limitations of current datasets, resulting in a curated collection of 1,100 queries organized into five major categories. Each category reflects a common user intent, facilitating granular analysis and ensuring comprehensive coverage of real-world demands on SpeechLLMs. This involves not only meticulously filtering for queries characteristic of spoken interaction but also a crucial subsequent phase of manual auditing, where every selected query was validated by human experts to ensure its quality and relevance.

Our evaluation framework significantly improves the precision of LLM-based judging for S2S interactions. Moving beyond generic rubrics that often overlook critical nuances, we strategically employ unique evaluation prompts for challenging queries. Crucially, these are not generic templates but meticulously hand-crafted checklists, each manually authored and fine-tuned by our team to highlight a specific query’s characteristics and potential pitfalls.

🏆 Leaderboard

Main evaluation results. TC, II, SR, OE, PF each stand for Text Creation, Information Inquiry, Solution Request, Opinion Exchange and Paralinguistic-Featured query.

Model TC II SR OE PF Avg.
Naive Pipeline 5.55 4.98 5.51 5.18 4.84 5.24
Kimi-Audio 4.45 4.33 4.79 4.70 4.92 4.54
GLM-4-Voice 5.16 4.77 5.41 5.04 4.51 5.03
MiniCPM 5.17 4.89 5.28 5.31 4.78 5.08
Qwen-2.5-omni 5.98 5.84 6.66 6.16 4.46 6.01
GPT-4o-Audio 6.74 6.06 6.39 6.32 6.01 6.29

We encourage you to submit new results directly through the issue tracker. The ranking list will be updated accordingly.

⚙️ Installation

  1. Clone the repository
  2. Set up
    conda create -n wildspeech python=3.10
    conda activate wildspeech
    pip install -r requirements.txt

📝 Usage

Basic Command

bash scripts/evaluate.sh <model> <step> 

Parameters

  • model: Name of the model to evaluate
    • Supported models: qwen2p5-omni, naive-qwen, minicpm, baichuan-audio, baichuan-omni, kimi-audio, etc.
  • step: Evaluation step to execute (1-3)
    • 1: Generate audio and transcriptions
    • 2: Evaluate transcription quality using GPT
    • 3: Analyze and summarize results

Examples

Evaluate all steps for qwen2p5-omni model:

bash scripts/evaluate.sh qwen2p5-omni 1

Run only gpt-4o-mini judge step:

bash scripts/evaluate.sh qwen2p5-omni 2

Run only results analysis step:

bash scripts/evaluate.sh qwen2p5-omni 3

🔦 Citation

@misc{zhang2025wildspeechbenchbenchmarkingendtoendspeechllms,
     title={WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild}, 
     author={Linhao Zhang and Jian Zhang and Bokai Lei and Chuhan Wu and Aiwei Liu and Wei Jia and Xiao Zhou},
     year={2025},
     eprint={2506.21875},
     archivePrefix={arXiv},
     primaryClass={cs.CL},
}

📜 License

See the License.txt file for details.

💐 Thanks

关于
747.0 KB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号