The MTP path in vLLM 0.23 imports quantization and fusion modules before the
MetaX plugin has installed compatibility shims. This can fail before generation
on _C::scaled_fp4_quant.out, _C::silu_and_mul_per_block_quant, or missing
torch.accelerator helpers in spawned EngineCore workers.
The PR adds an early vllm_metax.compat hook, imports it at package import
time, keeps the legacy triton_support import path active, and adds a focused
regression test for the act_quant_fusion import path.
The 35BA3B GPTQ validation exposed one additional vLLM 0.23 compatibility
issue in the MetaX moe_wna16 quantization path: MoeWNA16Method.apply()
assumed FusedMoEConfig.disable_inplace always exists. The PR now treats the
field as optional and preserves the existing default in-place behavior when it
is absent.
Before Fix
Pre-fix runs failed before generation on the same MetaX validation machine:
079_qwen35_9b_mtp261_smoke.txt: RuntimeError: operator _C::scaled_fp4_quant.out does not exist
084_qwen35_9b_mtp261_mtp_only_after_schema_fix.txt: AttributeError: module 'torch.accelerator' has no attribute 'empty_cache'
088_qwen35_27b_w8a8_mtp261_mtp_only_9b_tokenizer.txt: RuntimeError: operator _C::scaled_fp4_quant.out does not exist
Qwen3.5-9B OpenAI-compatible server API MTP smoke: /v1/models, /v1/completions, and /metrics passed, draft_tokens=36, accepted_tokens=31, accepted_tokens_per_pos=[17, 14]
Validation matrix using the same prompt and max_tokens=96:
Repetition metrics: bigram repeat ratio 0.0853, trigram 0.0391, 4-gram 0.0157
Most repeated 4-gram was Genshin Impact Version 5.0 with count 3, attributable to the requested topic rather than looped output
35BA3B model search and validation:
Official Qwen3.6-35B-A3B-FP8: downloaded, but MetaX/vLLM rejected it with fp8 quantization is currently not supported in maca.
Eco-Tech Qwen3.6-35B-A3B-w8a8: downloaded, but it is an Ascend/msmodelslim format with no vLLM quantization_config; MetaX treated it as unquantized MoE and OOMed.
FlagRelease Qwen3.6-35B-A3B-nomtp-metax-FlagOS: metadata includes MTP weights. After a dual-C500 machine became available, the checkpoint was downloaded and validated with tensor_parallel_size=2.
potter001/gptq-Qwen3.6-35B-A3B-4bit-group: downloaded from ModelScope, 7 safetensors shards, 19.27 GiB after Git LFS cache cleanup, quant_method=gptq, mtp_num_hidden_layers=1.
The 128-token TP=2 comparison shows accepted-token MTP acceleration on the official/FlagRelease 35BA3B checkpoint: 20.3389 / 10.619 = 1.915x output-token throughput for this prompt and run configuration.
Peak observed model memory was about 36.0 GiB per C500 with MTP enabled and about 36.0 GiB per C500 without MTP, so dual 64GB cards were sufficient.
35BA3B GPTQ MTP results after moe_wna16 compatibility fix:
The GPTQ 35BA3B output did not show the corrupted repeated-output loop from #261. The accepted-token count is explicitly recorded as zero, so this is functional MTP-path coverage rather than a speedup claim for this checkpoint/tokenizer workaround.
/metrics: exposed spec-decode counters with drafts=18, draft_tokens=36, accepted_tokens=31, accepted_tokens_per_pos=[17, 14]
This closes the server-path validation gap for a known-good MTP model. The 35BA3B GPTQ accepted-token gap remains checkpoint/tokenizer-specific.
Notes:
The available 30B-A3B model configs do not include MTP metadata; they were not used as #261 MTP substitutes.
The Qwen3.5-27B-W8A8 tokenizer config uses TokenizersBackend, which this Transformers environment cannot instantiate. The 27B validation therefore uses the local Qwen3.5-9B tokenizer as a smoke-test workaround.
The downloaded Qwen3.6-35B-A3B GPTQ tokenizer has the same TokenizersBackend compatibility problem, so the 35BA3B GPTQ runs also use the local Qwen3.5-9B tokenizer workaround.
vllm-metax-qwen3-mtp261
Competition work log for MetaX-MACA/vLLM-metax issue #261.
Current PR
xzh25:fix-qwen3-mtp261-compat045198c Fix MetaX vLLM 0.23 MTP import compatibilityff67715 Handle missing torch.cuda in MetaX compat hooks0466e6f Test compat import without torch.cuda82fccb2 Handle missing MoE disable_inplace config/data/vllm-metax-contribution/src/vLLM-metax-mtp261Fix Summary
The MTP path in vLLM 0.23 imports quantization and fusion modules before the MetaX plugin has installed compatibility shims. This can fail before generation on
_C::scaled_fp4_quant.out,_C::silu_and_mul_per_block_quant, or missingtorch.acceleratorhelpers in spawned EngineCore workers.The PR adds an early
vllm_metax.compathook, imports it at package import time, keeps the legacy triton_support import path active, and adds a focused regression test for theact_quant_fusionimport path.The 35BA3B GPTQ validation exposed one additional vLLM 0.23 compatibility issue in the MetaX
moe_wna16quantization path:MoeWNA16Method.apply()assumedFusedMoEConfig.disable_inplacealways exists. The PR now treats the field as optional and preserves the existing default in-place behavior when it is absent.Before Fix
Pre-fix runs failed before generation on the same MetaX validation machine:
079_qwen35_9b_mtp261_smoke.txt:RuntimeError: operator _C::scaled_fp4_quant.out does not exist084_qwen35_9b_mtp261_mtp_only_after_schema_fix.txt:AttributeError: module 'torch.accelerator' has no attribute 'empty_cache'088_qwen35_27b_w8a8_mtp261_mtp_only_9b_tokenizer.txt:RuntimeError: operator _C::scaled_fp4_quant.out does not existValidation
Remote MetaX C500 environment, MACA 3.5.3.20, vLLM 0.23.0:
collect_envandmx-smi: savedpython -m pytest tests/patch/test_triton_custom_op_schemas.py -q --confcutdir=tests/patch: 2 passedimport vllm_metax.quant_config; import vllm.compilation.passes.fusion.act_quant_fusion: passedpython -m py_compile ...: passedgit diff --check: passeddraft_tokens=86,accepted_tokens=54draft_tokens=82,accepted_tokens=54draft_tokens=90,accepted_tokens=82,accepted_tokens_per_pos=[41, 41]draft_tokens=62,accepted_tokens=0/v1/models,/v1/completions, and/metricspassed,draft_tokens=36,accepted_tokens=31,accepted_tokens_per_pos=[17, 14]Validation matrix using the same prompt and
max_tokens=96:Long-output MTP stress check:
max_tokens=256: generated 256 tokens normally at23.2837 tok/sdrafts=99,draft_tokens=198,accepted=158,accepted_tokens_per_pos=[85, 73]0.0853, trigram0.0391, 4-gram0.0157Genshin Impact Version 5.0with count3, attributable to the requested topic rather than looped output35BA3B model search and validation:
fp8 quantization is currently not supported in maca.quantization_config; MetaX treated it as unquantized MoE and OOMed.tensor_parallel_size=2.potter001/gptq-Qwen3.6-35B-A3B-4bit-group: downloaded from ModelScope, 7 safetensors shards, 19.27 GiB after Git LFS cache cleanup,quant_method=gptq,mtp_num_hidden_layers=1.35BA3B FlagRelease dual-C500 TP=2 results:
CUDA_VISIBLE_DEVICES=0,1,tensor_parallel_size=2,max_model_len=2048,max_num_seqs=1, model tokenizer.71903776768bytes,model_type=qwen3_5_moe, 1045 indexed weights, 19 MTP weights, no missing shards.max_tokens=32: target model and MTP drafter loaded,LOAD_S 296.123,GEN_S 7.929,OUTPUT_TOKENS_PER_S 4.0358,drafts=13,draft_tokens=26,accepted=20,accepted_tokens_per_pos=[13, 7].max_tokens=32:LOAD_S 194.923,GEN_S 7.104,OUTPUT_TOKENS_PER_S 4.5042.max_tokens=128:LOAD_S 150.358,GEN_S 6.293,OUTPUT_TOKENS_PER_S 20.3389,drafts=49,draft_tokens=98,accepted=79,accepted_tokens_per_pos=[46, 33].max_tokens=128:LOAD_S 150.576,GEN_S 12.054,OUTPUT_TOKENS_PER_S 10.619.20.3389 / 10.619 = 1.915xoutput-token throughput for this prompt and run configuration.36.0 GiBper C500 with MTP enabled and about36.0 GiBper C500 without MTP, so dual 64GB cards were sufficient.35BA3B GPTQ MTP results after
moe_wna16compatibility fix:max_tokens=32: generated normally,LOAD_S 130.03,GEN_S 6.99,drafts=31,draft_tokens=62,accepted=0.max_tokens=128: generated normally,OUTPUT_TOKENS_PER_S 9.2379, repeat ratios bigram0.0806, trigram0.0492, 4-gram0.0167.drafts=127,draft_tokens=254,accepted=0,accepted_tokens_per_pos=[0, 0].num_speculative_tokens=1,max_tokens=64: generated normally,OUTPUT_TOKENS_PER_S 7.3182,drafts=63,draft_tokens=63,accepted=0,accepted_tokens_per_pos=[0].OpenAI-compatible server API validation:
/mnt/moark-models/Qwen3.5-9Bvllm serve ... --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'/v1/models: returned served modelqwen35-9b-mtp/v1/completions: returned 48 completion tokens/metrics: exposed spec-decode counters withdrafts=18,draft_tokens=36,accepted_tokens=31,accepted_tokens_per_pos=[17, 14]Notes:
TokenizersBackend, which this Transformers environment cannot instantiate. The 27B validation therefore uses the local Qwen3.5-9B tokenizer as a smoke-test workaround.TokenizersBackendcompatibility problem, so the 35BA3B GPTQ runs also use the local Qwen3.5-9B tokenizer workaround.Remote logs:
/data/vllm-metax-contribution/ops/logs/079_qwen35_9b_mtp261_smoke.txt/data/vllm-metax-contribution/ops/logs/084_qwen35_9b_mtp261_mtp_only_after_schema_fix.txt/data/vllm-metax-contribution/ops/logs/088_qwen35_27b_w8a8_mtp261_mtp_only_9b_tokenizer.txt/data/vllm-metax-contribution/ops/logs/086_qwen35_9b_mtp261_mtp_only_after_accel_fix.txt/data/vllm-metax-contribution/ops/logs/089_test_top_level_compat_import_order.txt/data/vllm-metax-contribution/ops/logs/090_qwen35_27b_w8a8_mtp261_after_top_level_compat.txt/data/vllm-metax-contribution/ops/logs/091_qwen35_27b_w8a8_mtp261_issue_like_prompt.txt/data/vllm-metax-contribution/ops/logs/092_collect_env_mtp261.txt/data/vllm-metax-contribution/ops/logs/093_mtp261_validation_matrix.txt/data/vllm-metax-contribution/ops/logs/094_qwen35_27b_w8a8_mtp_long_output_check.txt/data/vllm-metax-contribution/ops/logs/095_qwen36_35ba3b_fp8_mtp_smoke.txt/data/vllm-metax-contribution/ops/logs/096_qwen36_35ba3b_w8a8_mtp_smoke.txt/data/vllm-metax-contribution/ops/logs/097_qwen36_35ba3b_w8a8_experts_int8_mtp_smoke.txt/data/vllm-metax-contribution/ops/logs/098_qwen36_35ba3b_w8a8_quark_mtp_smoke.txt/data/vllm-metax-contribution/ops/logs/099_qwen36_35ba3b_w8a8_int8_per_channel_mtp_smoke.txt/data/vllm-metax-contribution/ops/logs/100_qwen36_35ba3b_gptq_mtp_smoke.txt/data/vllm-metax-contribution/ops/logs/101_qwen36_35ba3b_gptq_mtp_9b_tokenizer_smoke.txt/data/vllm-metax-contribution/ops/logs/102_qwen36_35ba3b_gptq_mtp_9b_tokenizer_file_smoke.txt/data/vllm-metax-contribution/ops/logs/103_qwen36_35ba3b_gptq_mtp_after_moe_wna16_fix.txt/data/vllm-metax-contribution/ops/logs/104_qwen36_35ba3b_gptq_mtp_issue_like_check.txt/data/vllm-metax-contribution/ops/logs/105_qwen35_9b_mtp_openai_server_api.txt/data/vllm-metax-contribution/ops/logs/106_qwen36_35ba3b_gptq_mtp_one_token_check.txt/data/vllm-metax-contribution/ops/logs/200_dual_c500_download_flagrelease_35ba3b.log/data/vllm-metax-contribution/ops/logs/201_cpu_resume_flagrelease_35ba3b_lfs_pull.log/data/vllm-metax-contribution/ops/logs/202_dual_c500_flagrelease_35ba3b_mtp_smoke.log/data/vllm-metax-contribution/ops/logs/203_dual_c500_flagrelease_35ba3b_no_mtp_smoke.log/data/vllm-metax-contribution/ops/logs/204_dual_c500_flagrelease_35ba3b_mtp_128.log/data/vllm-metax-contribution/ops/logs/205_dual_c500_flagrelease_35ba3b_no_mtp_128.logNext