ok no prob. ill try with my own
But vllm-tune can also tune TP=1 right?
@jcrone I had the very same problem and for some reason FP8 variant directly from Qwen uses num_experts instead of num_local_experts → that means MoE check fails. Updating the script (vllm-tune.sh) to search for num_experts will make it run just fine. I think @serapis can update it to search for both variants.
But I have one more question - I have did that with tp=1 for my single unit and I see barely any speed increase (yes it is Qwen 3.6 35B A3B). I’ve tested it numerous times with different vllm parametres, but see almost no difference at all. vLLM shows MoE and FP8 configs are correctly loaded.
Yes! You can deploy the mod via --sync-mod - GitHub - SeraphimSerapis/vllm-tune: vLLM Tune consolidates MoE and FP8 dense GEMM kernel tuning into a single command · GitHub. I personally run my own flavour of recipes, so please let me know if that doesn’t end up working.
I also tweaked the script and it now recognizes MoE configs in the Qwen/Gemma-style more reliably – release notes here.
Update via curl -fsSL https://raw.githubusercontent.com/SeraphimSerapis/vllm-tune/main/install.sh | bash.
You may not see the same dramatic effect based on the setup, number of nodes, etc. I would recommend running 2-3 benchmarks at different context depth, etc. before applying the mod and after applying the mod and comparing the averages.
Dense models profit the least from this, MoE models can see up to 10% improvement in token generation.