Existing semantic speech tokenizers suffer from critical instability in noisy environments, causing downstream SpeechLLMs to generate inconsistent or erroneous outputs when processing real-world audio.
StableToken solves this through two key innovations:
🗳️ Voting-LFQ: A novel multi-voter quantization mechanism that achieves robust consensus under noise
🔊 Noise-Aware Consensus Training: A multi-branch training paradigm that enhances representational stability by achieving a global consensus between noisy and clean branches
This results in:
✅ 2.5× more stable than existing tokenizers (UED: 10.17% vs 26.17%)
✅ High-quality speech reconstruction from discrete tokens
✅ Seamless integration with downstream LLMs
✅ Superior downstream performance for SpeechLLMs on noisy audio
[!NOTE]
UED (Unit Edit Distance) measures the edit distance between token sequences from clean and noisy audio. Lower UED indicates better noise robustness. StableToken achieves 60% UED reduction over the best existing supervised semantic tokenizer.
Run UED Evaluation
Before running the evaluation, you need to prepare a parquet file containing paired clean and noisy audio data. You can use the audiomentations library to add noise to clean audio samples.
[!TIP]
Data Format: The current code in ued.py expects the parquet file to contain specific columns: 'audio_en_clean', 'audio_en_noise', 'audio_zh_clean', and 'audio_zh_noise'. You can easily modify the column names in the script to match your custom dataset structure.
We thank the authors of GLM-4-Voice for their open-source code.
📜 Citation
If you find StableToken useful for your research, please cite:
@article{song2025stabletoken,
title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs},
author={Song, Yuhan and Zhang, Linhao and Wu, Chuhan and Liu, Aiwei and Jia, Wei and Wang, Houfeng and Zhou, Xiao},
journal={arXiv preprint arXiv:2509.22220},
year={2025}
}
(ICLR 2026) StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
WeChat AI
🏆 State-of-the-art noise robustness — 60% lower UED than best existing supervised semantic tokenizers
Stability Comparison: As noise scale increases from 0% to 100%, StableToken maintains highly stable token sequences (bottom), while baseline tokenizer (middle) shows significant instability and jitter.
📢 News
💡 Why StableToken?
Existing semantic speech tokenizers suffer from critical instability in noisy environments, causing downstream SpeechLLMs to generate inconsistent or erroneous outputs when processing real-world audio.
StableToken solves this through two key innovations:
This results in:
🚀 Quick Start
Installation
Detailed Installation Guide Using Conda (Click to expand)
Clone the repository with submodules:
If you have already cloned without
--recursive:Create a conda environment:
Install dependencies:
Download Model
Using huggingface-cli:
Or using Python:
Run Inference
Command Line Arguments:
--devicestrautoauto,cpu,cuda,cuda:0, etc.)--model_pathstr--audio_pathlist[str]Example Command:
Example Output:
💻 Usage
Python API
For a complete runnable example, please refer to
example_usage.py. Below is a simplified example of using the core components:Supported Audio Formats
We recommend using WAV format. However, StableToken supports all audio formats compatible with
torchaudio(including.flac,.mp3, etc.).📊 Performance
StableToken achieves state-of-the-art noise robustness while maintaining high reconstruction quality.
Noise Robustness
Run UED Evaluation
Before running the evaluation, you need to prepare a parquet file containing paired clean and noisy audio data. You can use the audiomentations library to add noise to clean audio samples.
Speech Reconstruction
Measurements of Word Error Rate (WER, ↓) and Mean Opinion Score (MOS, ↑) on LibriSpeech (LS) and SEED benchmarks.
Rate
LS-clean
LS-other
SEED-en
SEED-zh
LS-clean
LS-other
SEED-en
SEED-zh
🦁 Model Zoo
🙏 Acknowledgements
We thank the authors of GLM-4-Voice for their open-source code.
📜 Citation
If you find StableToken useful for your research, please cite:
📄 License
This project is licensed under the License Term of StableToken.