目录

vllm-metax-qwen3-mtp261

Competition work log for MetaX-MACA/vLLM-metax issue #261.

Current PR

  • Upstream PR: https://github.com/MetaX-MACA/vLLM-metax/pull/294
  • Status: ready for review, opened from xzh25:fix-qwen3-mtp261-compat
  • Local source commits:
    • 045198c Fix MetaX vLLM 0.23 MTP import compatibility
    • ff67715 Handle missing torch.cuda in MetaX compat hooks
    • 0466e6f Test compat import without torch.cuda
    • 82fccb2 Handle missing MoE disable_inplace config
  • Remote validation worktree: /data/vllm-metax-contribution/src/vLLM-metax-mtp261

Fix Summary

The MTP path in vLLM 0.23 imports quantization and fusion modules before the MetaX plugin has installed compatibility shims. This can fail before generation on _C::scaled_fp4_quant.out, _C::silu_and_mul_per_block_quant, or missing torch.accelerator helpers in spawned EngineCore workers.

The PR adds an early vllm_metax.compat hook, imports it at package import time, keeps the legacy triton_support import path active, and adds a focused regression test for the act_quant_fusion import path.

The 35BA3B GPTQ validation exposed one additional vLLM 0.23 compatibility issue in the MetaX moe_wna16 quantization path: MoeWNA16Method.apply() assumed FusedMoEConfig.disable_inplace always exists. The PR now treats the field as optional and preserves the existing default in-place behavior when it is absent.

Before Fix

Pre-fix runs failed before generation on the same MetaX validation machine:

  • 079_qwen35_9b_mtp261_smoke.txt: RuntimeError: operator _C::scaled_fp4_quant.out does not exist
  • 084_qwen35_9b_mtp261_mtp_only_after_schema_fix.txt: AttributeError: module 'torch.accelerator' has no attribute 'empty_cache'
  • 088_qwen35_27b_w8a8_mtp261_mtp_only_9b_tokenizer.txt: RuntimeError: operator _C::scaled_fp4_quant.out does not exist

Validation

Remote MetaX C500 environment, MACA 3.5.3.20, vLLM 0.23.0:

  • collect_env and mx-smi: saved
  • python -m pytest tests/patch/test_triton_custom_op_schemas.py -q --confcutdir=tests/patch: 2 passed
  • import vllm_metax.quant_config; import vllm.compilation.passes.fusion.act_quant_fusion: passed
  • python -m py_compile ...: passed
  • git diff --check: passed
  • Qwen3.5-9B MTP smoke: generated normally, draft_tokens=86, accepted_tokens=54
  • Qwen3.5-27B-W8A8 MTP smoke: generated normally, draft_tokens=82, accepted_tokens=54
  • Issue-like 27B W8A8 MTP prompt: generated normally, draft_tokens=90, accepted_tokens=82, accepted_tokens_per_pos=[41, 41]
  • Qwen3.6-35B-A3B GPTQ MTP smoke: target and MTP drafter loaded, generated normally, draft_tokens=62, accepted_tokens=0
  • Qwen3.5-9B OpenAI-compatible server API MTP smoke: /v1/models, /v1/completions, and /metrics passed, draft_tokens=36, accepted_tokens=31, accepted_tokens_per_pos=[17, 14]

Validation matrix using the same prompt and max_tokens=96:

Case MTP Load s Generate s Output tok/s Repeated bigram ratio Spec decode
Qwen3.5-9B no 90.289 3.995 24.0311 0.0125 n/a
Qwen3.5-9B yes 93.406 3.352 28.6370 0.0125 drafts=45, draft_tokens=90, accepted=50, per_pos=[30, 20]
Qwen3.5-27B-W8A8 no 99.703 9.338 10.2801 0.0444 n/a
Qwen3.5-27B-W8A8 yes 102.532 4.434 21.6529 0.0000 drafts=35, draft_tokens=70, accepted=60, per_pos=[31, 29]

Long-output MTP stress check:

  • Qwen3.5-27B-W8A8 with MTP, max_tokens=256: generated 256 tokens normally at 23.2837 tok/s
  • Spec decode metrics: drafts=99, draft_tokens=198, accepted=158, accepted_tokens_per_pos=[85, 73]
  • Repetition metrics: bigram repeat ratio 0.0853, trigram 0.0391, 4-gram 0.0157
  • Most repeated 4-gram was Genshin Impact Version 5.0 with count 3, attributable to the requested topic rather than looped output

35BA3B model search and validation:

  • Official Qwen3.6-35B-A3B-FP8: downloaded, but MetaX/vLLM rejected it with fp8 quantization is currently not supported in maca.
  • Eco-Tech Qwen3.6-35B-A3B-w8a8: downloaded, but it is an Ascend/msmodelslim format with no vLLM quantization_config; MetaX treated it as unquantized MoE and OOMed.
  • FlagRelease Qwen3.6-35B-A3B-nomtp-metax-FlagOS: metadata includes MTP weights. After a dual-C500 machine became available, the checkpoint was downloaded and validated with tensor_parallel_size=2.
  • potter001/gptq-Qwen3.6-35B-A3B-4bit-group: downloaded from ModelScope, 7 safetensors shards, 19.27 GiB after Git LFS cache cleanup, quant_method=gptq, mtp_num_hidden_layers=1.

35BA3B FlagRelease dual-C500 TP=2 results:

  • Environment: 2x MetaX C500 64GB, MACA 3.5.3.20, vLLM 0.23.0, CUDA_VISIBLE_DEVICES=0,1, tensor_parallel_size=2, max_model_len=2048, max_num_seqs=1, model tokenizer.
  • Model integrity: 21 safetensors shards, 71903776768 bytes, model_type=qwen3_5_moe, 1045 indexed weights, 19 MTP weights, no missing shards.
  • MTP smoke, max_tokens=32: target model and MTP drafter loaded, LOAD_S 296.123, GEN_S 7.929, OUTPUT_TOKENS_PER_S 4.0358, drafts=13, draft_tokens=26, accepted=20, accepted_tokens_per_pos=[13, 7].
  • No-MTP smoke, max_tokens=32: LOAD_S 194.923, GEN_S 7.104, OUTPUT_TOKENS_PER_S 4.5042.
  • MTP smoke, max_tokens=128: LOAD_S 150.358, GEN_S 6.293, OUTPUT_TOKENS_PER_S 20.3389, drafts=49, draft_tokens=98, accepted=79, accepted_tokens_per_pos=[46, 33].
  • No-MTP smoke, max_tokens=128: LOAD_S 150.576, GEN_S 12.054, OUTPUT_TOKENS_PER_S 10.619.
  • The 128-token TP=2 comparison shows accepted-token MTP acceleration on the official/FlagRelease 35BA3B checkpoint: 20.3389 / 10.619 = 1.915x output-token throughput for this prompt and run configuration.
  • Peak observed model memory was about 36.0 GiB per C500 with MTP enabled and about 36.0 GiB per C500 without MTP, so dual 64GB cards were sufficient.

35BA3B GPTQ MTP results after moe_wna16 compatibility fix:

  • Smoke check, max_tokens=32: generated normally, LOAD_S 130.03, GEN_S 6.99, drafts=31, draft_tokens=62, accepted=0.
  • Issue-like check, max_tokens=128: generated normally, OUTPUT_TOKENS_PER_S 9.2379, repeat ratios bigram 0.0806, trigram 0.0492, 4-gram 0.0167.
  • Spec metrics for the 128-token check: drafts=127, draft_tokens=254, accepted=0, accepted_tokens_per_pos=[0, 0].
  • One-token MTP check, num_speculative_tokens=1, max_tokens=64: generated normally, OUTPUT_TOKENS_PER_S 7.3182, drafts=63, draft_tokens=63, accepted=0, accepted_tokens_per_pos=[0].
  • The GPTQ 35BA3B output did not show the corrupted repeated-output loop from #261. The accepted-token count is explicitly recorded as zero, so this is functional MTP-path coverage rather than a speedup claim for this checkpoint/tokenizer workaround.

OpenAI-compatible server API validation:

  • Model: /mnt/moark-models/Qwen3.5-9B
  • Command path: vllm serve ... --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
  • /v1/models: returned served model qwen35-9b-mtp
  • /v1/completions: returned 48 completion tokens
  • /metrics: exposed spec-decode counters with drafts=18, draft_tokens=36, accepted_tokens=31, accepted_tokens_per_pos=[17, 14]
  • This closes the server-path validation gap for a known-good MTP model. The 35BA3B GPTQ accepted-token gap remains checkpoint/tokenizer-specific.

Notes:

  • The available 30B-A3B model configs do not include MTP metadata; they were not used as #261 MTP substitutes.
  • The Qwen3.5-27B-W8A8 tokenizer config uses TokenizersBackend, which this Transformers environment cannot instantiate. The 27B validation therefore uses the local Qwen3.5-9B tokenizer as a smoke-test workaround.
  • The downloaded Qwen3.6-35B-A3B GPTQ tokenizer has the same TokenizersBackend compatibility problem, so the 35BA3B GPTQ runs also use the local Qwen3.5-9B tokenizer workaround.

Remote logs:

  • /data/vllm-metax-contribution/ops/logs/079_qwen35_9b_mtp261_smoke.txt
  • /data/vllm-metax-contribution/ops/logs/084_qwen35_9b_mtp261_mtp_only_after_schema_fix.txt
  • /data/vllm-metax-contribution/ops/logs/088_qwen35_27b_w8a8_mtp261_mtp_only_9b_tokenizer.txt
  • /data/vllm-metax-contribution/ops/logs/086_qwen35_9b_mtp261_mtp_only_after_accel_fix.txt
  • /data/vllm-metax-contribution/ops/logs/089_test_top_level_compat_import_order.txt
  • /data/vllm-metax-contribution/ops/logs/090_qwen35_27b_w8a8_mtp261_after_top_level_compat.txt
  • /data/vllm-metax-contribution/ops/logs/091_qwen35_27b_w8a8_mtp261_issue_like_prompt.txt
  • /data/vllm-metax-contribution/ops/logs/092_collect_env_mtp261.txt
  • /data/vllm-metax-contribution/ops/logs/093_mtp261_validation_matrix.txt
  • /data/vllm-metax-contribution/ops/logs/094_qwen35_27b_w8a8_mtp_long_output_check.txt
  • /data/vllm-metax-contribution/ops/logs/095_qwen36_35ba3b_fp8_mtp_smoke.txt
  • /data/vllm-metax-contribution/ops/logs/096_qwen36_35ba3b_w8a8_mtp_smoke.txt
  • /data/vllm-metax-contribution/ops/logs/097_qwen36_35ba3b_w8a8_experts_int8_mtp_smoke.txt
  • /data/vllm-metax-contribution/ops/logs/098_qwen36_35ba3b_w8a8_quark_mtp_smoke.txt
  • /data/vllm-metax-contribution/ops/logs/099_qwen36_35ba3b_w8a8_int8_per_channel_mtp_smoke.txt
  • /data/vllm-metax-contribution/ops/logs/100_qwen36_35ba3b_gptq_mtp_smoke.txt
  • /data/vllm-metax-contribution/ops/logs/101_qwen36_35ba3b_gptq_mtp_9b_tokenizer_smoke.txt
  • /data/vllm-metax-contribution/ops/logs/102_qwen36_35ba3b_gptq_mtp_9b_tokenizer_file_smoke.txt
  • /data/vllm-metax-contribution/ops/logs/103_qwen36_35ba3b_gptq_mtp_after_moe_wna16_fix.txt
  • /data/vllm-metax-contribution/ops/logs/104_qwen36_35ba3b_gptq_mtp_issue_like_check.txt
  • /data/vllm-metax-contribution/ops/logs/105_qwen35_9b_mtp_openai_server_api.txt
  • /data/vllm-metax-contribution/ops/logs/106_qwen36_35ba3b_gptq_mtp_one_token_check.txt
  • /data/vllm-metax-contribution/ops/logs/200_dual_c500_download_flagrelease_35ba3b.log
  • /data/vllm-metax-contribution/ops/logs/201_cpu_resume_flagrelease_35ba3b_lfs_pull.log
  • /data/vllm-metax-contribution/ops/logs/202_dual_c500_flagrelease_35ba3b_mtp_smoke.log
  • /data/vllm-metax-contribution/ops/logs/203_dual_c500_flagrelease_35ba3b_no_mtp_smoke.log
  • /data/vllm-metax-contribution/ops/logs/204_dual_c500_flagrelease_35ba3b_mtp_128.log
  • /data/vllm-metax-contribution/ops/logs/205_dual_c500_flagrelease_35ba3b_no_mtp_128.log

Next

  • Watch PR #294 maintainer feedback and merge status.
  • If maintainers require a narrower fix, split the compatibility hook into the exact import paths they prefer.
  • GPUApps PR record has been submitted: https://www.gitlink.org.cn/ccf-ai-infra/GPUApps/issues/213
  • After merge, append the merge commit and final merge evidence to GPUApps issue #213.
关于

面向 vLLM-metax #261 Qwen3.6 MTP speculative decoding 输出损坏问题的开源贡献参赛仓库,用于记录项目调研、问题复现、修复设计、测试验证、性能对比、PR 提交与 merge 过程。

54.0 KB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802047560号