目录

VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions

Overview

VStyle is a bilingual (Chinese & English) benchmark for voice style adaptation. It covers four key tasks:

  • Acoustic attribute control
  • Natural language instruction following
  • Role-playing
  • Implicit empathy

To enable automated and reproducible evaluation, we introduce the LALM-as-a-Judge framework, which assesses model outputs across three dimensions:

  • Textual faithfulness (Is it saying the right thing?)
  • Style adherence (Does it match the intended style?)
  • Naturalness (Does it sound smooth and natural?)

VStyle goes beyond checking correctness — it evaluates how well the model speaks. Experiments on various open-source and commercial systems show its effectiveness in differentiating the voice style adaptation abilities of different models.

Leaderboard

  • Evaluation results of different SLMs across different task types.

Evaluate your model

We provide a Gemini API–based evaluation tool for assessing voice synthesis quality across multiple dimensions. It automatically processes audio samples, generates scores, and produces comprehensive analysis reports.

Quick Example:

# Install dependencies
pip install google-generativeai matplotlib pandas tqdm

# Run evaluation on example data
python lalm_eval/gemini_eval.py \
    --root_dir ./data/examples/model_res/en/wav \
    --metadata_path ./data/examples/model_res/en/metadata.jsonl \
    --out_dir ./data/examples/eval_res/en \
    --gemini_api_key YOUR_API_KEY

For detailed usage instructions, see: lalm_eval/README.md.

For inference results of other models reported in our paper, please refer to the dataset at https://huggingface.co/datasets/zhanjun/VStyle-responses.

Human-Model Correlation Analysis

We reproduce the correlation study between human annotations and LALM-as-a-Judge as reported in the paper. This validates the reliability of automated evaluation.

Quick Example:

# Download evaluation results of all seven models
huggingface-cli download --repo-type dataset --local-dir-use-symlinks False zhanjun/VStyle-eval-results --local-dir VStyle-eval-results

# Compute Spearman correlations
python human_align/compute_model_human_spearman_r.py

For detailed analysis instructions, see: human_align/README.md

Contributing

To submit your evaluation results to VStyle, please send the results file (metadata_with_score.jsonl) to jzhan24@m.fudan.edu.cn.

License

This project is licensed under the MIT License.

关于
14.9 MB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号