WildSpeech-Bench is the first benchmark for evaluating the speech-to-speech (S2S) capabilities of speechLLMs, characterized by both its evaluation framework and its construction process.
🪝 Construction
Our benchmark construction process directly counters the limitations of current datasets, resulting
in a curated collection of 1,100 queries organized into five major categories. Each category reflects a
common user intent, facilitating granular analysis and ensuring comprehensive coverage of real-world
demands on SpeechLLMs. This involves not only meticulously filtering for queries characteristic of spoken interaction but also a crucial subsequent phase of manual auditing, where every selected query
was validated by human experts to ensure its quality and relevance.
Our evaluation framework significantly improves the precision of LLM-based judging for S2S
interactions. Moving beyond generic rubrics that often overlook critical nuances, we strategically
employ unique evaluation prompts for challenging queries. Crucially, these are not generic templates
but meticulously hand-crafted checklists, each manually authored and fine-tuned by our team to
highlight a specific query’s characteristics and potential pitfalls.
🏆 Leaderboard
Main evaluation results. TC, II, SR, OE, PF each stand for Text Creation, Information Inquiry, Solution Request, Opinion Exchange and Paralinguistic-Featured query.
Model
TC
II
SR
OE
PF
Avg.
Naive Pipeline
5.55
4.98
5.51
5.18
4.84
5.24
Kimi-Audio
4.45
4.33
4.79
4.70
4.92
4.54
GLM-4-Voice
5.16
4.77
5.41
5.04
4.51
5.03
MiniCPM
5.17
4.89
5.28
5.31
4.78
5.08
Qwen-2.5-omni
5.98
5.84
6.66
6.16
4.46
6.01
GPT-4o-Audio
6.74
6.06
6.39
6.32
6.01
6.29
We encourage you to submit new results directly through the issue tracker. The ranking list will be updated accordingly.
Supported models: qwen2p5-omni, naive-qwen, minicpm, baichuan-audio, baichuan-omni, kimi-audio, etc.
step: Evaluation step to execute (1-3)
1: Generate audio and transcriptions
2: Evaluate transcription quality using GPT
3: Analyze and summarize results
Examples
Evaluate all steps for qwen2p5-omni model:
bash scripts/evaluate.sh qwen2p5-omni 1
Run only gpt-4o-mini judge step:
bash scripts/evaluate.sh qwen2p5-omni 2
Run only results analysis step:
bash scripts/evaluate.sh qwen2p5-omni 3
🔦 Citation
@misc{zhang2025wildspeechbenchbenchmarkingendtoendspeechllms,
title={WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild},
author={Linhao Zhang and Jian Zhang and Bokai Lei and Chuhan Wu and Aiwei Liu and Wei Jia and Xiao Zhou},
year={2025},
eprint={2506.21875},
archivePrefix={arXiv},
primaryClass={cs.CL},
}
WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild
📑 Paper | 🤗 Dataset | 🐙 GitHub
This repository contains the evaluation code for the paper “[WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild]“.
🔔 Introduction
WildSpeech-Bench is the first benchmark for evaluating the speech-to-speech (S2S) capabilities of speechLLMs, characterized by both its evaluation framework and its construction process.
🪝 Construction
Our benchmark construction process directly counters the limitations of current datasets, resulting in a curated collection of 1,100 queries organized into five major categories. Each category reflects a common user intent, facilitating granular analysis and ensuring comprehensive coverage of real-world demands on SpeechLLMs. This involves not only meticulously filtering for queries characteristic of spoken interaction but also a crucial subsequent phase of manual auditing, where every selected query was validated by human experts to ensure its quality and relevance.
Our evaluation framework significantly improves the precision of LLM-based judging for S2S interactions. Moving beyond generic rubrics that often overlook critical nuances, we strategically employ unique evaluation prompts for challenging queries. Crucially, these are not generic templates but meticulously hand-crafted checklists, each manually authored and fine-tuned by our team to highlight a specific query’s characteristics and potential pitfalls.
🏆 Leaderboard
Main evaluation results. TC, II, SR, OE, PF each stand for Text Creation, Information Inquiry, Solution Request, Opinion Exchange and Paralinguistic-Featured query.
We encourage you to submit new results directly through the issue tracker. The ranking list will be updated accordingly.
⚙️ Installation
📝 Usage
Basic Command
Parameters
model: Name of the model to evaluatestep: Evaluation step to execute (1-3)Examples
Evaluate all steps for qwen2p5-omni model:
Run only gpt-4o-mini judge step:
Run only results analysis step:
🔦 Citation
📜 License
See the License.txt file for details.
💐 Thanks