VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions
Overview
VStyle is a bilingual (Chinese & English) benchmark for voice style adaptation. It covers four key tasks:
Acoustic attribute control
Natural language instruction following
Role-playing
Implicit empathy
To enable automated and reproducible evaluation, we introduce the LALM-as-a-Judge framework, which assesses model outputs across three dimensions:
Textual faithfulness (Is it saying the right thing?)
Style adherence (Does it match the intended style?)
Naturalness (Does it sound smooth and natural?)
VStyle goes beyond checking correctness — it evaluates how well the model speaks. Experiments on various open-source and commercial systems show its effectiveness in differentiating the voice style adaptation abilities of different models.
Evaluation results of different SLMs across different task types.
Evaluate your model
We provide a Gemini API–based evaluation tool for assessing voice synthesis quality across multiple dimensions. It automatically processes audio samples, generates scores, and produces comprehensive analysis reports.
Quick Example:
# Install dependencies
pip install google-generativeai matplotlib pandas tqdm
# Run evaluation on example data
python lalm_eval/gemini_eval.py \
--root_dir ./data/examples/model_res/en/wav \
--metadata_path ./data/examples/model_res/en/metadata.jsonl \
--out_dir ./data/examples/eval_res/en \
--gemini_api_key YOUR_API_KEY
We reproduce the correlation study between human annotations and LALM-as-a-Judge as reported in the paper. This validates the reliability of automated evaluation.
VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions
Overview
VStyle is a bilingual (Chinese & English) benchmark for voice style adaptation. It covers four key tasks:
To enable automated and reproducible evaluation, we introduce the LALM-as-a-Judge framework, which assesses model outputs across three dimensions:
VStyle goes beyond checking correctness — it evaluates how well the model speaks. Experiments on various open-source and commercial systems show its effectiveness in differentiating the voice style adaptation abilities of different models.
Leaderboard
Evaluation results of different SLMs.
We evaluate three proprietary systems GPT-4o Audio (snapshot: gpt-4o-audio-preview-2025-06-03), GPT-4o-Mini Audio (snapshot: gpt-4o-mini-audio-preview-2024-12-17), and Doubao. Additionally, we include four open-source end-to-end speech language models with strong speech generation performance: Step-Audio, Kimi-Audio, Baichuan-Audio, and Qwen-2.5 Omni.
Evaluation results of different SLMs across different task types.
Evaluate your model
We provide a Gemini API–based evaluation tool for assessing voice synthesis quality across multiple dimensions. It automatically processes audio samples, generates scores, and produces comprehensive analysis reports.
Quick Example:
For detailed usage instructions, see: lalm_eval/README.md.
For inference results of other models reported in our paper, please refer to the dataset at https://huggingface.co/datasets/zhanjun/VStyle-responses.
Human-Model Correlation Analysis
We reproduce the correlation study between human annotations and LALM-as-a-Judge as reported in the paper. This validates the reliability of automated evaluation.
Quick Example:
For detailed analysis instructions, see: human_align/README.md
Contributing
To submit your evaluation results to VStyle, please send the results file (metadata_with_score.jsonl) to jzhan24@m.fudan.edu.cn.
License
This project is licensed under the MIT License.