VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions

Overview

VStyle is a bilingual (Chinese & English) benchmark for voice style adaptation. It covers four key tasks:

Acoustic attribute control
Natural language instruction following
Role-playing
Implicit empathy

To enable automated and reproducible evaluation, we introduce the LALM-as-a-Judge framework, which assesses model outputs across three dimensions:

Textual faithfulness (Is it saying the right thing?)
Style adherence (Does it match the intended style?)
Naturalness (Does it sound smooth and natural?)

VStyle goes beyond checking correctness — it evaluates how well the model speaks. Experiments on various open-source and commercial systems show its effectiveness in differentiating the voice style adaptation abilities of different models.

Leaderboard

Evaluation results of different SLMs.

We evaluate three proprietary systems GPT-4o Audio (snapshot: gpt-4o-audio-preview-2025-06-03), GPT-4o-Mini Audio (snapshot: gpt-4o-mini-audio-preview-2024-12-17), and Doubao. Additionally, we include four open-source end-to-end speech language models with strong speech generation performance: Step-Audio, Kimi-Audio, Baichuan-Audio, and Qwen-2.5 Omni.

Evaluation results of different SLMs across different task types.

Evaluate your model

We provide a Gemini API–based evaluation tool for assessing voice synthesis quality across multiple dimensions. It automatically processes audio samples, generates scores, and produces comprehensive analysis reports.

Quick Example:

# Install dependencies
pip install google-generativeai matplotlib pandas tqdm

# Run evaluation on example data
python lalm_eval/gemini_eval.py \
    --root_dir ./data/examples/model_res/en/wav \
    --metadata_path ./data/examples/model_res/en/metadata.jsonl \
    --out_dir ./data/examples/eval_res/en \
    --gemini_api_key YOUR_API_KEY

For detailed usage instructions, see: lalm_eval/README.md.

For inference results of other models reported in our paper, please refer to the dataset at https://huggingface.co/datasets/zhanjun/VStyle-responses.

Human-Model Correlation Analysis

We reproduce the correlation study between human annotations and LALM-as-a-Judge as reported in the paper. This validates the reliability of automated evaluation.

Quick Example:

# Download evaluation results of all seven models
huggingface-cli download --repo-type dataset --local-dir-use-symlinks False zhanjun/VStyle-eval-results --local-dir VStyle-eval-results

# Compute Spearman correlations
python human_align/compute_model_human_spearman_r.py

For detailed analysis instructions, see: human_align/README.md

Contributing

To submit your evaluation results to VStyle, please send the results file (metadata_with_score.jsonl) to jzhan24@m.fudan.edu.cn.

License

This project is licensed under the MIT License.