[BugFix][CI] Fix GLM5.1-W8A8 MTP load weight error when vLLM is updated to v0.21.0 (#10317)
What this PR does / why we need it?
After vllm PR #41706, GlmMoeDsaForCausalLM.load_weights uses
AutoWeightsLoaderwhich does not skiprot.weightof the mtp layer, and causes ValueError while loading weights.Does this PR introduce any user-facing change?
No.
How was this patch tested?
The model weights can be correctly loaded, and the acceptance rate of draft model is also right.
(APIServer pid=774401) INFO 06-11 11:07:59 [loggers.py:271] Engine 000: Avg prompt throughput: 1.0 tokens/s, Avg generation throughput: 7.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0% (APIServer pid=774401) INFO 06-11 11:07:59 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.50, Accepted throughput: 0.94 tokens/s, Drafted throughput: 1.13 tokens/s, Accepted: 50 tokens, Drafted: 60 tokens, Per-position acceptance rate: 1.000, 0.900, 0.600, Avg Draft acceptance rate: 83.3% (APIServer pid=774401) INFO 06-11 11:08:09 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 8.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0% (APIServer pid=774401) INFO 06-11 11:08:09 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.58, Accepted throughput: 5.20 tokens/s, Drafted throughput: 9.90 tokens/s, Accepted: 52 tokens, Drafted: 99 tokens, Per-position acceptance rate: 0.818, 0.606, 0.152, Avg Draft acceptance rate: 52.5% (APIServer pid=774401) INFO 06-11 11:08:19 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0% (APIServer pid=774401) INFO 06-11 11:08:19 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.03, Accepted throughput: 6.70 tokens/s, Drafted throughput: 9.90 tokens/s, Accepted: 67 tokens, Drafted: 99 tokens, Per-position acceptance rate: 0.909, 0.697, 0.424, Avg Draft acceptance rate: 67.7% (APIServer pid=774401) INFO 06-11 11:08:29 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0% (APIServer pid=774401) INFO 06-11 11:08:29 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.79, Accepted throughput: 6.10 tokens/s, Drafted throughput: 10.20 tokens/s, Accepted: 61 tokens, Drafted: 102 tokens, Per-position acceptance rate: 0.824, 0.588, 0.382, Avg Draft acceptance rate: 59.8%
- vLLM version: v0.21.0
- vLLM main: https://github.com/vllm-project/vllm/commit/9090368b650896bf5fc990c921df7eb4c20355a5
Signed-off-by: Wangbingjie wangbj1207@126.com
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802047560号
vLLM Ascend Plugin
| 关于昇腾 | 官方文档 | #sig-ascend | 用户论坛 | 社区例会 |
English | 中文
最新消息 🔥
更多内容
总览
vLLM 昇腾插件 (
vllm-ascend) 是一个由社区维护的让vLLM在Ascend NPU无缝运行的后端插件。此插件是 vLLM 社区中支持昇腾后端的推荐方式。它遵循[RFC]: Hardware pluggable所述原则:通过解耦的方式提供了vLLM对Ascend NPU的支持。
使用 vLLM 昇腾插件,可以让类Transformer、混合专家(MOE)、嵌入、多模态等流行的大语言模型在 Ascend NPU 上无缝运行。
支持的模型详细信息,请参考模型支持列表。
准备
开始使用
推荐您使用以下版本快速开始使用:
分支策略
vllm-ascend有主干分支和开发分支。
releases/v0.13.0是vllm-ascend针对vLLMv0.13.0版本的开发分支。下面是维护中的分支:
请参阅版本策略了解更多详细信息。
贡献
请参考CONTRIBUTING文档了解更多关于开发环境搭建、功能测试以及 PR 提交规范的信息。
我们欢迎并重视任何形式的贡献与合作:
社区例会
许可证
Apache 许可证 2.0,如 LICENSE 文件中所示。