📊 Added AR baseline evaluation script for benchmarking against autoregressive models
2026-01-02
🎓 Released fine-tuning framework with support for dense & MagiAttention backends
2025-12-29
🚀 Initial release: WeDLM-7B/8B models, inference engine, evaluation suite
💡 Why WeDLM?
Most diffusion language models use bidirectional attention, which breaks KV cache compatibility and fails to translate parallel prediction into actual speedups over optimized AR engines like vLLM.
WeDLM solves this by performing parallel mask recovery under standard causal attention, enabling:
✅ Native KV cache compatibility (FlashAttention, PagedAttention, CUDA Graphs)
✅ Direct initialization from pre-trained AR models (Qwen2.5, Qwen3)
✅ Real speedups measured against production-grade vLLM baselines
GSM8K benchmark: WeDLM achieves 3-6× speedup over vLLM while maintaining competitive accuracy.
🚀 Quick Start
Installation
git clone https://github.com/tencent/WeDLM.git
cd WeDLM && bash install.sh
# Pull the Docker image
docker pull aiweiliu/wedlm:v3
# Run the container with GPU support
docker run -it --gpus all -p 8080:8080 --name wedlm aiweiliu/wedlm:v3 /bin/bash
# Inside the container, run inference directly
python example.py --model tencent/WeDLM-8B-Instruct
# Inside the container, run web demo
python web_demo.py --model tencent/WeDLM-8B-Instruct
Run Inference
# Run simple generation
python example.py --model tencent/WeDLM-8B-Instruct
Example Output (NVIDIA H20):
Prompt: A store sells apples for $2 each and oranges for $3 each...
Response: To determine the total amount Tom spent...
Therefore, the total amount Tom spent is $22.
==================================================
Generated tokens: 218
Time elapsed: 0.32 seconds
Speed: 689.18 tokens/s ⚡
==================================================
WeDLM’s speedup varies by task characteristics. Structured, low-entropy tasks (math, code) see the largest gains.
Scenario
Speedup vs vLLM
Notes
Math Reasoning (GSM8K, MATH)
3-6×
Structured output, high confidence predictions
Code Generation
2-3×
Predictable syntax patterns
Sequential/Counting Tasks
Up to 10×
Highly deterministic outputs
Open-ended QA
1.5-2×
Higher entropy limits parallel acceptance
[!NOTE]
Acceleration comes with a quality-speed tradeoff. Conservative settings preserve accuracy; aggressive settings maximize speed. See our paper for detailed analysis.
Generation Quality
WeDLM preserves and often improves upon its base AR model capabilities.
🏆 Base Models Benchmark (Click to collapse)
Benchmark
Qwen2.5-7B
Qwen3-8B
LLaDA-8B
Dream-7B
WeDLM-7B
WeDLM-8B
ARC-C
89.93
92.66
81.14
88.40
90.70
92.92
GSM8K
79.23
85.97
71.80
75.97
84.76
90.20
MATH
43.40
50.80
28.00
38.00
48.20
53.60
HumanEval
59.14
68.90
31.71
20.12
68.90
75.00
MMLU
71.62
74.03
64.61
70.64
71.93
75.46
Average
67.21
72.61
55.44
56.91
70.84
74.72
🎓 Instruct Models Benchmark (Click to expand)
Benchmark
Qwen2.5-7B-Inst
Qwen3-8B-Inst
SDAR-8B-Inst
WeDLM-7B-Inst
WeDLM-8B-Inst
ARC-C
86.09
91.47
91.13
89.59
92.92
GSM8K
89.91
89.91
91.66
87.57
92.27
MATH
45.00
69.60
43.40
55.40
64.80
HumanEval
76.22
71.95
76.83
75.00
80.49
MMLU
71.98
71.52
73.61
70.52
75.14
Average
71.09
75.12
74.22
72.78
77.53
🦁 Model Zoo
Model
Base
Context
Download
WeDLM-7B
Qwen2.5-7B
32k
WeDLM-7B-Instruct
Qwen2.5-7B
32k
WeDLM-8B
Qwen3-8B
32k
WeDLM-8B-Instruct ⭐
Qwen3-8B
32k
🛠️ Advanced Usage
Installation from Source
Requirements: Python 3.9+, PyTorch 2.1+, CUDA 11.8+
git clone https://github.com/tencent/WeDLM.git
cd WeDLM && bash install.sh
[!NOTE]
flash-attn requires compilation and must be installed after PyTorch.
The install.sh script handles this automatically (default: CUDA 12.9).
For other CUDA versions: CUDA_VERSION=cu124 bash install.sh
magi — MagiAttention for optimized training (requires separate installation)
See finetune/README.md for detailed configuration and data format.
HuggingFace Compatibility
For training or simple forward passes, we provide a standard HF interface.
[!WARNING]
For fast inference, use the wedlm engine (shown in Quick Start). The HF interface below is for training/forward pass convenience only.
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("tencent/WeDLM-8B-Instruct", trust_remote_code=True)
inputs = tokenizer("Hello", return_tensors="pt")
out = model(**inputs)
🧠 How It Works
WeDLM introduces Topological Reordering to perform parallel mask recovery under standard causal attention, combined with Streaming Parallel Decoding for continuous prefix commitment.
👉 Project Page — Interactive explanations and visualizations
👉 Paper — Technical details and full experimental results
📜 Citation
If you find WeDLM useful for your research, please cite:
@article{liu2025wedlm,
title={WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference},
author={Liu, Aiwei and He, Minghua and Zeng, Shaoxun and Zhang, Linhao and Wu, Chuhan and Jia, Wei and Liu, Yuan and Yu, Yang and Zhou, Xiao and Zhou, Jie},
journal={arXiv preprint arXiv:2512.22737},
year={2025}
}
WeDLM
Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference
WeChat AI, Tencent
🏆 The fastest diffusion language model, capable of outperforming vLLM-optimized autoregressive baselines in wall-clock speed
⬆️ Real-time comparison: Qwen3-8B-Instruct with vLLM (left) vs WeDLM-8B-Instruct (right) on the same prompt
📢 What's New
💡 Why WeDLM?
Most diffusion language models use bidirectional attention, which breaks KV cache compatibility and fails to translate parallel prediction into actual speedups over optimized AR engines like vLLM.
WeDLM solves this by performing parallel mask recovery under standard causal attention, enabling:
GSM8K benchmark: WeDLM achieves 3-6× speedup over vLLM while maintaining competitive accuracy.
🚀 Quick Start
Installation
Manual Installation
Docker Installation
Run Inference
Example Output (NVIDIA H20):
💻 Interactive Demo & API
Web Demo:
⬆️ Interactive web interface for real-time generation
Python API:
📊 Performance
Speed-Quality Tradeoff
WeDLM’s speedup varies by task characteristics. Structured, low-entropy tasks (math, code) see the largest gains.
Generation Quality
WeDLM preserves and often improves upon its base AR model capabilities.
🏆 Base Models Benchmark (Click to collapse)
🎓 Instruct Models Benchmark (Click to expand)
🦁 Model Zoo
🛠️ Advanced Usage
Installation from Source
Requirements: Python 3.9+, PyTorch 2.1+, CUDA 11.8+
Evaluation
Reproduce our results using the provided scripts:
Evaluate AR Baseline Models:
To benchmark against autoregressive models using vLLM:
See
evaluation/demo.shfor all benchmark commands.Fine-tuning
Fine-tune WeDLM on your own data using our training framework:
Attention Backends:
dense— PyTorch native SDPA (works out-of-the-box)magi— MagiAttention for optimized training (requires separate installation)See
finetune/README.mdfor detailed configuration and data format.HuggingFace Compatibility
For training or simple forward passes, we provide a standard HF interface.
🧠 How It Works
WeDLM introduces Topological Reordering to perform parallel mask recovery under standard causal attention, combined with Streaming Parallel Decoding for continuous prefix commitment.
📜 Citation
If you find WeDLM useful for your research, please cite:
Acknowledgement
Built upon nano-vllm, Qwen, and vLLM.
License
Apache 2.0