vllm-metax-qwen3-mtp261

Competition work log for MetaX-MACA/vLLM-metax issue #261.

Current PR

Upstream PR: https://github.com/MetaX-MACA/vLLM-metax/pull/294
Status: ready for review, opened from xzh25:fix-qwen3-mtp261-compat
Local source commits:
- 045198c Fix MetaX vLLM 0.23 MTP import compatibility
- ff67715 Handle missing torch.cuda in MetaX compat hooks
- 0466e6f Test compat import without torch.cuda
- 82fccb2 Handle missing MoE disable_inplace config
Remote validation worktree: /data/vllm-metax-contribution/src/vLLM-metax-mtp261

Fix Summary

The MTP path in vLLM 0.23 imports quantization and fusion modules before the MetaX plugin has installed compatibility shims. This can fail before generation on _C::scaled_fp4_quant.out, _C::silu_and_mul_per_block_quant, or missing torch.accelerator helpers in spawned EngineCore workers.

The PR adds an early vllm_metax.compat hook, imports it at package import time, keeps the legacy triton_support import path active, and adds a focused regression test for the act_quant_fusion import path.

The 35BA3B GPTQ validation exposed one additional vLLM 0.23 compatibility issue in the MetaX moe_wna16 quantization path: MoeWNA16Method.apply() assumed FusedMoEConfig.disable_inplace always exists. The PR now treats the field as optional and preserves the existing default in-place behavior when it is absent.

Before Fix

Pre-fix runs failed before generation on the same MetaX validation machine:

079_qwen35_9b_mtp261_smoke.txt: RuntimeError: operator _C::scaled_fp4_quant.out does not exist
084_qwen35_9b_mtp261_mtp_only_after_schema_fix.txt: AttributeError: module 'torch.accelerator' has no attribute 'empty_cache'
088_qwen35_27b_w8a8_mtp261_mtp_only_9b_tokenizer.txt: RuntimeError: operator _C::scaled_fp4_quant.out does not exist

Validation

Remote MetaX C500 environment, MACA 3.5.3.20, vLLM 0.23.0:

collect_env and mx-smi: saved
python -m pytest tests/patch/test_triton_custom_op_schemas.py -q --confcutdir=tests/patch: 2 passed
import vllm_metax.quant_config; import vllm.compilation.passes.fusion.act_quant_fusion: passed
python -m py_compile ...: passed
git diff --check: passed
Qwen3.5-9B MTP smoke: generated normally, draft_tokens=86, accepted_tokens=54
Qwen3.5-27B-W8A8 MTP smoke: generated normally, draft_tokens=82, accepted_tokens=54
Issue-like 27B W8A8 MTP prompt: generated normally, draft_tokens=90, accepted_tokens=82, accepted_tokens_per_pos=[41, 41]
Qwen3.6-35B-A3B GPTQ MTP smoke: target and MTP drafter loaded, generated normally, draft_tokens=62, accepted_tokens=0
Qwen3.5-9B OpenAI-compatible server API MTP smoke: /v1/models, /v1/completions, and /metrics passed, draft_tokens=36, accepted_tokens=31, accepted_tokens_per_pos=[17, 14]

Validation matrix using the same prompt and max_tokens=96:

Case	MTP	Load s	Generate s	Output tok/s	Repeated bigram ratio	Spec decode
Qwen3.5-9B	no	90.289	3.995	24.0311	0.0125	n/a
Qwen3.5-9B	yes	93.406	3.352	28.6370	0.0125	drafts=45, draft_tokens=90, accepted=50, per_pos=[30, 20]
Qwen3.5-27B-W8A8	no	99.703	9.338	10.2801	0.0444	n/a
Qwen3.5-27B-W8A8	yes	102.532	4.434	21.6529	0.0000	drafts=35, draft_tokens=70, accepted=60, per_pos=[31, 29]

Long-output MTP stress check:

Qwen3.5-27B-W8A8 with MTP, max_tokens=256: generated 256 tokens normally at 23.2837 tok/s
Spec decode metrics: drafts=99, draft_tokens=198, accepted=158, accepted_tokens_per_pos=[85, 73]
Repetition metrics: bigram repeat ratio 0.0853, trigram 0.0391, 4-gram 0.0157
Most repeated 4-gram was Genshin Impact Version 5.0 with count 3, attributable to the requested topic rather than looped output

35BA3B model search and validation:

Official Qwen3.6-35B-A3B-FP8: downloaded, but MetaX/vLLM rejected it with fp8 quantization is currently not supported in maca.
Eco-Tech Qwen3.6-35B-A3B-w8a8: downloaded, but it is an Ascend/msmodelslim format with no vLLM quantization_config; MetaX treated it as unquantized MoE and OOMed.
FlagRelease Qwen3.6-35B-A3B-nomtp-metax-FlagOS: metadata includes MTP weights. After a dual-C500 machine became available, the checkpoint was downloaded and validated with tensor_parallel_size=2.
potter001/gptq-Qwen3.6-35B-A3B-4bit-group: downloaded from ModelScope, 7 safetensors shards, 19.27 GiB after Git LFS cache cleanup, quant_method=gptq, mtp_num_hidden_layers=1.

35BA3B FlagRelease dual-C500 TP=2 results:

Environment: 2x MetaX C500 64GB, MACA 3.5.3.20, vLLM 0.23.0, CUDA_VISIBLE_DEVICES=0,1, tensor_parallel_size=2, max_model_len=2048, max_num_seqs=1, model tokenizer.
Model integrity: 21 safetensors shards, 71903776768 bytes, model_type=qwen3_5_moe, 1045 indexed weights, 19 MTP weights, no missing shards.
MTP smoke, max_tokens=32: target model and MTP drafter loaded, LOAD_S 296.123, GEN_S 7.929, OUTPUT_TOKENS_PER_S 4.0358, drafts=13, draft_tokens=26, accepted=20, accepted_tokens_per_pos=[13, 7].
No-MTP smoke, max_tokens=32: LOAD_S 194.923, GEN_S 7.104, OUTPUT_TOKENS_PER_S 4.5042.
MTP smoke, max_tokens=128: LOAD_S 150.358, GEN_S 6.293, OUTPUT_TOKENS_PER_S 20.3389, drafts=49, draft_tokens=98, accepted=79, accepted_tokens_per_pos=[46, 33].
No-MTP smoke, max_tokens=128: LOAD_S 150.576, GEN_S 12.054, OUTPUT_TOKENS_PER_S 10.619.
The 128-token TP=2 comparison shows accepted-token MTP acceleration on the official/FlagRelease 35BA3B checkpoint: 20.3389 / 10.619 = 1.915x output-token throughput for this prompt and run configuration.
Peak observed model memory was about 36.0 GiB per C500 with MTP enabled and about 36.0 GiB per C500 without MTP, so dual 64GB cards were sufficient.

35BA3B GPTQ MTP results after moe_wna16 compatibility fix:

Smoke check, max_tokens=32: generated normally, LOAD_S 130.03, GEN_S 6.99, drafts=31, draft_tokens=62, accepted=0.
Issue-like check, max_tokens=128: generated normally, OUTPUT_TOKENS_PER_S 9.2379, repeat ratios bigram 0.0806, trigram 0.0492, 4-gram 0.0167.
Spec metrics for the 128-token check: drafts=127, draft_tokens=254, accepted=0, accepted_tokens_per_pos=[0, 0].
One-token MTP check, num_speculative_tokens=1, max_tokens=64: generated normally, OUTPUT_TOKENS_PER_S 7.3182, drafts=63, draft_tokens=63, accepted=0, accepted_tokens_per_pos=[0].
The GPTQ 35BA3B output did not show the corrupted repeated-output loop from #261. The accepted-token count is explicitly recorded as zero, so this is functional MTP-path coverage rather than a speedup claim for this checkpoint/tokenizer workaround.

OpenAI-compatible server API validation:

Model: /mnt/moark-models/Qwen3.5-9B
Command path: vllm serve ... --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
/v1/models: returned served model qwen35-9b-mtp
/v1/completions: returned 48 completion tokens
/metrics: exposed spec-decode counters with drafts=18, draft_tokens=36, accepted_tokens=31, accepted_tokens_per_pos=[17, 14]
This closes the server-path validation gap for a known-good MTP model. The 35BA3B GPTQ accepted-token gap remains checkpoint/tokenizer-specific.

Notes:

The available 30B-A3B model configs do not include MTP metadata; they were not used as #261 MTP substitutes.
The Qwen3.5-27B-W8A8 tokenizer config uses TokenizersBackend, which this Transformers environment cannot instantiate. The 27B validation therefore uses the local Qwen3.5-9B tokenizer as a smoke-test workaround.
The downloaded Qwen3.6-35B-A3B GPTQ tokenizer has the same TokenizersBackend compatibility problem, so the 35BA3B GPTQ runs also use the local Qwen3.5-9B tokenizer workaround.

Remote logs:

/data/vllm-metax-contribution/ops/logs/079_qwen35_9b_mtp261_smoke.txt
/data/vllm-metax-contribution/ops/logs/084_qwen35_9b_mtp261_mtp_only_after_schema_fix.txt
/data/vllm-metax-contribution/ops/logs/088_qwen35_27b_w8a8_mtp261_mtp_only_9b_tokenizer.txt
/data/vllm-metax-contribution/ops/logs/086_qwen35_9b_mtp261_mtp_only_after_accel_fix.txt
/data/vllm-metax-contribution/ops/logs/089_test_top_level_compat_import_order.txt
/data/vllm-metax-contribution/ops/logs/090_qwen35_27b_w8a8_mtp261_after_top_level_compat.txt
/data/vllm-metax-contribution/ops/logs/091_qwen35_27b_w8a8_mtp261_issue_like_prompt.txt
/data/vllm-metax-contribution/ops/logs/092_collect_env_mtp261.txt
/data/vllm-metax-contribution/ops/logs/093_mtp261_validation_matrix.txt
/data/vllm-metax-contribution/ops/logs/094_qwen35_27b_w8a8_mtp_long_output_check.txt
/data/vllm-metax-contribution/ops/logs/095_qwen36_35ba3b_fp8_mtp_smoke.txt
/data/vllm-metax-contribution/ops/logs/096_qwen36_35ba3b_w8a8_mtp_smoke.txt
/data/vllm-metax-contribution/ops/logs/097_qwen36_35ba3b_w8a8_experts_int8_mtp_smoke.txt
/data/vllm-metax-contribution/ops/logs/098_qwen36_35ba3b_w8a8_quark_mtp_smoke.txt
/data/vllm-metax-contribution/ops/logs/099_qwen36_35ba3b_w8a8_int8_per_channel_mtp_smoke.txt
/data/vllm-metax-contribution/ops/logs/100_qwen36_35ba3b_gptq_mtp_smoke.txt
/data/vllm-metax-contribution/ops/logs/101_qwen36_35ba3b_gptq_mtp_9b_tokenizer_smoke.txt
/data/vllm-metax-contribution/ops/logs/102_qwen36_35ba3b_gptq_mtp_9b_tokenizer_file_smoke.txt
/data/vllm-metax-contribution/ops/logs/103_qwen36_35ba3b_gptq_mtp_after_moe_wna16_fix.txt
/data/vllm-metax-contribution/ops/logs/104_qwen36_35ba3b_gptq_mtp_issue_like_check.txt
/data/vllm-metax-contribution/ops/logs/105_qwen35_9b_mtp_openai_server_api.txt
/data/vllm-metax-contribution/ops/logs/106_qwen36_35ba3b_gptq_mtp_one_token_check.txt
/data/vllm-metax-contribution/ops/logs/200_dual_c500_download_flagrelease_35ba3b.log
/data/vllm-metax-contribution/ops/logs/201_cpu_resume_flagrelease_35ba3b_lfs_pull.log
/data/vllm-metax-contribution/ops/logs/202_dual_c500_flagrelease_35ba3b_mtp_smoke.log
/data/vllm-metax-contribution/ops/logs/203_dual_c500_flagrelease_35ba3b_no_mtp_smoke.log
/data/vllm-metax-contribution/ops/logs/204_dual_c500_flagrelease_35ba3b_mtp_128.log
/data/vllm-metax-contribution/ops/logs/205_dual_c500_flagrelease_35ba3b_no_mtp_128.log

Watch PR #294 maintainer feedback and merge status.
If maintainers require a narrower fix, split the compatibility hook into the exact import paths they prefer.
GPUApps PR record has been submitted: https://www.gitlink.org.cn/ccf-ai-infra/GPUApps/issues/213
After merge, append the merge commit and final merge evidence to GPUApps issue #213.

vllm-metax-qwen3-mtp261

Current PR

Fix Summary

Before Fix

Validation

Next