Fun-CosyVoice 3.0 is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.
Key Features
Language Coverage: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents (Guangdong, Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.) and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning.
Content Consistency & Naturalness: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.
Pronunciation Inpainting: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.
Text Normalization: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.
Bi-Streaming: Support both text-in streaming and audio-out streaming, and achieves latency as low as 150ms while maintaining high-quality audio output.
Instruct Support: Supports various instructions such as languages, dialects, emotions, speed, volume, etc.
Roadmap
2025/12
release Fun-CosyVoice3-0.5B-2512 base model, rl model and its training/inference script
release Fun-CosyVoice3-0.5B modelscope gradio space
2025/08
Thanks to the contribution from NVIDIA Yuekai Zhang, add triton trtllm runtime support and cosyvoice2 grpo training support
2025/07
release Fun-CosyVoice 3.0 eval set
2025/05
add CosyVoice2-0.5B vllm support
2024/12
25hz CosyVoice2-0.5B released
2024/09
25hz CosyVoice-300M base model
25hz CosyVoice-300M voice conversion function
2024/08
Repetition Aware Sampling(RAS) inference for llm stability
Streaming inference mode support, including kv cache and sdpa for rtf optimization
2024/07
Flow matching training support
WeTextProcessing support when ttsfrd is not available
Fastapi server and client
Evaluation
Model
Open-Source
Model Size
test-zh CER (%) ↓
test-zh SS (%) ↑
test-en WER (%) ↓
test-en SS (%) ↑
test-hard CER (%) ↓
test-hard SS (%) ↑
Human
-
-
1.26
75.5
2.14
73.4
-
-
Seed-TTS
❌
-
1.12
79.6
2.25
76.2
7.59
77.6
MiniMax-Speech
❌
-
0.83
78.3
1.65
69.2
-
-
F5-TTS
✅
0.3B
1.52
74.1
2.00
64.7
8.67
71.3
Spark TTS
✅
0.5B
1.2
66.0
1.98
57.3
-
-
CosyVoice2
✅
0.5B
1.45
75.7
2.57
65.9
6.83
72.4
FireRedTTS2
✅
1.5B
1.14
73.2
1.95
66.5
-
-
Index-TTS2
✅
1.5B
1.03
76.5
2.23
70.6
7.12
75.5
VibeVoice-1.5B
✅
1.5B
1.16
74.4
3.04
68.9
-
-
VibeVoice-Realtime
✅
0.5B
-
-
2.05
63.3
-
-
HiggsAudio-v2
✅
3B
1.50
74.0
2.44
67.7
-
-
VoxCPM
✅
0.5B
0.93
77.2
1.85
72.9
8.87
73.0
GLM-TTS
✅
1.5B
1.03
76.1
-
-
-
-
GLM-TTS RL
✅
1.5B
0.89
76.4
-
-
-
-
Fun-CosyVoice3-0.5B-2512
✅
0.5B
1.21
78.0
2.24
71.8
6.71
75.8
Fun-CosyVoice3-0.5B-2512_RL
✅
0.5B
0.81
77.4
1.68
69.5
5.44
75.0
Install
Clone and install
Clone the repo
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
# If you failed to clone the submodule due to network failures, please run the following command until success
cd CosyVoice
git submodule update --init --recursive
We strongly recommend that you download our pretrained Fun-CosyVoice3-0.5BCosyVoice2-0.5BCosyVoice-300MCosyVoice-300M-SFTCosyVoice-300M-Instruct model and CosyVoice-ttsfrd resource.
# modelscope SDK model download
from modelscope import snapshot_download
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
snapshot_download('iic/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M')
snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
snapshot_download('iic/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')
snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
# for oversea users, huggingface SDK model download
from huggingface_hub import snapshot_download
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
snapshot_download('FunAudioLLM/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
snapshot_download('FunAudioLLM/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M')
snapshot_download('FunAudioLLM/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
snapshot_download('FunAudioLLM/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')
snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
Optionally, you can unzip ttsfrd resource and install ttsfrd package for better text normalization performance.
Notice that this step is not necessary. If you do not install ttsfrd package, we will use wetext by default.
We strongly recommend using Fun-CosyVoice3-0.5B for better performance.
Follow the code in example.py for detailed usage of each model.
python example.py
vLLM Usage
CosyVoice2/3 now supports vLLM 0.11.x+ (V1 engine) and vLLM 0.9.0 (legacy).
Older vllm version(<0.9.0) do not support CosyVoice inference, and versions in between (e.g., 0.10.x) are not tested.
Notice that vllm has a lot of specific requirements. You can create a new env to in case your hardward do not support vllm and old env is corrupted.
@article{du2024cosyvoice,
title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},
author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},
journal={arXiv preprint arXiv:2407.05407},
year={2024}
}
@article{du2024cosyvoice,
title={Cosyvoice 2: Scalable streaming speech synthesis with large language models},
author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},
journal={arXiv preprint arXiv:2412.10117},
year={2024}
}
@article{du2025cosyvoice,
title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others},
journal={arXiv preprint arXiv:2505.17589},
year={2025}
}
@inproceedings{lyu2025build,
title={Build LLM-Based Zero-Shot Streaming TTS System with Cosyvoice},
author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao},
booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--2},
year={2025},
organization={IEEE}
}
Disclaimer
The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.
👉🏻 CosyVoice 👈🏻
Fun-CosyVoice 3.0: Demos; Paper; Modelscope; Huggingface; CV3-Eval
CosyVoice 2.0: Demos; Paper; Modelscope; HuggingFace
CosyVoice 1.0: Demos; Paper; Modelscope; HuggingFace
Highlight🔥
Fun-CosyVoice 3.0 is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.
Key Features
Roadmap
2025/12
2025/08
2025/07
2025/05
2024/12
2024/09
2024/08
2024/07
Evaluation
CER (%) ↓
SS (%) ↑
WER (%) ↓
SS (%) ↑
CER (%) ↓
SS (%) ↑
Install
Clone and install
Clone the repo
Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
Create Conda env:
Model download
We strongly recommend that you download our pretrained
Fun-CosyVoice3-0.5BCosyVoice2-0.5BCosyVoice-300MCosyVoice-300M-SFTCosyVoice-300M-Instructmodel andCosyVoice-ttsfrdresource.Optionally, you can unzip
ttsfrdresource and installttsfrdpackage for better text normalization performance.Notice that this step is not necessary. If you do not install
ttsfrdpackage, we will use wetext by default.Basic Usage
We strongly recommend using
Fun-CosyVoice3-0.5Bfor better performance. Follow the code inexample.pyfor detailed usage of each model.vLLM Usage
CosyVoice2/3 now supports vLLM 0.11.x+ (V1 engine) and vLLM 0.9.0 (legacy). Older vllm version(<0.9.0) do not support CosyVoice inference, and versions in between (e.g., 0.10.x) are not tested.
Notice that
vllmhas a lot of specific requirements. You can create a new env to in case your hardward do not support vllm and old env is corrupted.Start web demo
You can use our web demo page to get familiar with CosyVoice quickly.
Please see the demo website for details.
Advanced Usage
For advanced users, we have provided training and inference scripts in
examples/libritts.Build for deployment
Optionally, if you want service deployment, You can run the following steps.
Using Nvidia TensorRT-LLM for deployment
Using TensorRT-LLM to accelerate cosyvoice2 llm could give 4x acceleration comparing with huggingface transformers implementation. To quick start:
For more details, you could check here
Discussion & Communication
You can directly discuss on Github Issues.
You can also scan the QR code to join our official Dingding chat group.
Acknowledge
Citations
Disclaimer
The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.