[!Note]
This repo contains the algorithm infrastructure and some simple examples.
[!Tip]
For the extended end-user products, please refer to the index repo Awesome-ChatTTS maintained by the community. You can find a diagram visualization of the codebase here.
ChatTTS is a text-to-speech model designed specifically for dialogue scenarios such as LLM assistant.
Conversational TTS: ChatTTS is optimized for dialogue-based tasks, enabling natural and expressive speech synthesis. It supports multiple speakers, facilitating interactive conversations.
Fine-grained Control: The model could predict and control fine-grained prosodic features, including laughter, pauses, and interjections.
Better Prosody: ChatTTS surpasses most of open-source TTS models in terms of prosody. We provide pretrained models to support further research and development.
Dataset & Model
[!Important]
The released model is for academic purposes only.
The main model is trained with Chinese and English audio data of 100,000+ hours.
The open-source version on HuggingFace is a 40,000 hours pre-trained model without SFT.
Roadmap
Open-source the 40k-hours-base model and spk_stats file.
Streaming audio generation.
Open-source DVAE encoder and zero shot inferring code.
Multi-emotion controlling.
ChatTTS.cpp (new repo in 2noise org is welcomed)
Licenses
The Code
The code is published under AGPLv3+ license.
The model
The model is published under CC BY-NC 4.0 license. It is intended for educational and research use, and should not be used for any commercial or illegal purposes. The authors do not guarantee the accuracy, completeness, or reliability of the information. The information and data used in this repo, are for academic and research purposes only. The data obtained from publicly available sources, and the authors do not claim any ownership or copyright over the data.
Disclaimer
ChatTTS is a powerful text-to-speech system. However, it is very important to utilize this technology responsibly and ethically. To limit the use of ChatTTS, we added a small amount of high-frequency noise during the training of the 40,000-hour model, and compressed the audio quality as much as possible using MP3 format, to prevent malicious actors from potentially using it for criminal purposes. At the same time, we have internally trained a detection model and plan to open-source it in the future.
Contact
GitHub issues/PRs are always welcomed.
Formal Inquiries
For formal inquiries about the model and roadmap, please contact us at open-source@2noise.com.
Unrecommended Optional: Install TransformerEngine if using NVIDIA GPU (Linux only)
[!Warning]
DO NOT INSTALL!
The adaptation of TransformerEngine is currently under development and CANNOT run properly now.
Only install it on developing purpose. See more details on at #672 #676
[!Warning]
DO NOT INSTALL!
Currently the FlashAttention-2 will slow down the generating speed according to this issue.
Only install it on developing purpose.
Make sure you are under the project root directory when you execute these commands below.
1. Launch WebUI
python examples/web/webui.py
2. Infer by Command Line
It will save audio to ./output_audio_n.mp3
python examples/cmd/run.py "Your text 1." "Your text 2."
Installation
Install the stable version from PyPI
pip install ChatTTS
Install the latest version from GitHub
pip install git+https://github.com/2noise/ChatTTS
Install from local directory in dev mode
pip install -e .
Basic Usage
import ChatTTS
import torch
import torchaudio
chat = ChatTTS.Chat()
chat.load(compile=False) # Set to True for better performance
texts = ["PUT YOUR 1st TEXT HERE", "PUT YOUR 2nd TEXT HERE"]
wavs = chat.infer(texts)
for i in range(len(wavs)):
"""
In some versions of torchaudio, the first line works but in other versions, so does the second line.
"""
try:
torchaudio.save(f"basic_output{i}.wav", torch.from_numpy(wavs[i]).unsqueeze(0), 24000)
except:
torchaudio.save(f"basic_output{i}.wav", torch.from_numpy(wavs[i]), 24000)
Advanced Usage
###################################
# Sample a speaker from Gaussian.
rand_spk = chat.sample_random_speaker()
print(rand_spk) # save it for later timbre recovery
params_infer_code = ChatTTS.Chat.InferCodeParams(
spk_emb = rand_spk, # add sampled speaker
temperature = .3, # using custom temperature
top_P = 0.7, # top P decode
top_K = 20, # top K decode
)
###################################
# For sentence level manual control.
# use oral_(0-9), laugh_(0-2), break_(0-7)
# to generate special token in text to synthesize.
params_refine_text = ChatTTS.Chat.RefineTextParams(
prompt='[oral_2][laugh_0][break_6]',
)
wavs = chat.infer(
texts,
params_refine_text=params_refine_text,
params_infer_code=params_infer_code,
)
###################################
# For word level manual control.
text = 'What is [uv_break]your favorite english food?[laugh][lbreak]'
wavs = chat.infer(text, skip_refine_text=True, params_refine_text=params_refine_text, params_infer_code=params_infer_code)
"""
In some versions of torchaudio, the first line works but in other versions, so does the second line.
"""
try:
torchaudio.save("word_level_output.wav", torch.from_numpy(wavs[0]).unsqueeze(0), 24000)
except:
torchaudio.save("word_level_output.wav", torch.from_numpy(wavs[0]), 24000)
Example: self introduction
inputs_en = """
chat T T S is a text to speech model designed for dialogue applications.
[uv_break]it supports mixed language input [uv_break]and offers multi speaker
capabilities with precise control over prosodic elements like
[uv_break]laughter[uv_break][laugh], [uv_break]pauses, [uv_break]and intonation.
[uv_break]it delivers natural and expressive speech,[uv_break]so please
[uv_break] use the project responsibly at your own risk.[uv_break]
""".replace('\n', '') # English is still experimental.
params_refine_text = ChatTTS.Chat.RefineTextParams(
prompt='[oral_2][laugh_0][break_4]',
)
audio_array_en = chat.infer(inputs_en, params_refine_text=params_refine_text)
torchaudio.save("self_introduction_output.wav", torch.from_numpy(audio_array_en[0]), 24000)
1. How much VRAM do I need? How about infer speed?
For a 30-second audio clip, at least 4GB of GPU memory is required. For the 4090 GPU, it can generate audio corresponding to approximately 7 semantic tokens per second. The Real-Time Factor (RTF) is around 0.3.
2. Model stability is not good enough, with issues such as multi speakers or poor audio quality.
This is a problem that typically occurs with autoregressive models (for bark and valle). It’s generally difficult to avoid. One can try multiple samples to find a suitable result.
3. Besides laughter, can we control anything else? Can we control other emotions?
In the current released model, the only token-level control units are [laugh], [uv_break], and [lbreak]. In future versions, we may open-source models with additional emotional control capabilities.
Acknowledgements
bark, XTTSv2 and valle demonstrate a remarkable TTS result by an autoregressive-style system.
fish-speech reveals capability of GVQ as audio tokenizer for LLM modeling.
ChatTTS
A generative speech model for daily dialogue.
English | 简体中文 | 日本語 | Русский | Español | Français | 한국어
Introduction
ChatTTS is a text-to-speech model designed specifically for dialogue scenarios such as LLM assistant.
Supported Languages
Highlights
Dataset & Model
Roadmap
2noiseorg is welcomed)Licenses
The Code
The code is published under
AGPLv3+license.The model
The model is published under
CC BY-NC 4.0license. It is intended for educational and research use, and should not be used for any commercial or illegal purposes. The authors do not guarantee the accuracy, completeness, or reliability of the information. The information and data used in this repo, are for academic and research purposes only. The data obtained from publicly available sources, and the authors do not claim any ownership or copyright over the data.Disclaimer
ChatTTS is a powerful text-to-speech system. However, it is very important to utilize this technology responsibly and ethically. To limit the use of ChatTTS, we added a small amount of high-frequency noise during the training of the 40,000-hour model, and compressed the audio quality as much as possible using MP3 format, to prevent malicious actors from potentially using it for criminal purposes. At the same time, we have internally trained a detection model and plan to open-source it in the future.
Contact
Formal Inquiries
For formal inquiries about the model and roadmap, please contact us at open-source@2noise.com.
Online Chat
1. QQ Group (Chinese Social APP)
2. Discord Server
Join by clicking here.
Get Started
Clone Repo
Install requirements
1. Install Directly
2. Install from conda
Optional: Install vLLM (Linux only)
Unrecommended Optional: Install TransformerEngine if using NVIDIA GPU (Linux only)
Unrecommended Optional: Install FlashAttention-2 (mainly NVIDIA GPU)
Quick Start
1. Launch WebUI
2. Infer by Command Line
Installation
Install the stable version from PyPI
Install the latest version from GitHub
Install from local directory in dev mode
Basic Usage
Advanced Usage
Example: self introduction
male speaker
female speaker
male speaker
female speaker
FAQ
1. How much VRAM do I need? How about infer speed?
For a 30-second audio clip, at least 4GB of GPU memory is required. For the 4090 GPU, it can generate audio corresponding to approximately 7 semantic tokens per second. The Real-Time Factor (RTF) is around 0.3.
2. Model stability is not good enough, with issues such as multi speakers or poor audio quality.
This is a problem that typically occurs with autoregressive models (for bark and valle). It’s generally difficult to avoid. One can try multiple samples to find a suitable result.
3. Besides laughter, can we control anything else? Can we control other emotions?
In the current released model, the only token-level control units are
[laugh],[uv_break], and[lbreak]. In future versions, we may open-source models with additional emotional control capabilities.Acknowledgements
Special Appreciation
Thanks to all contributors for their efforts