### 🎉 MiniMax-M2-NVFP4
No matter what parameter combos or hacks I throw at it, the tokenizer still spits out pure nonsense.
Here’s the max one:
# One-time patch to disable FlashInfer autotune:
docker exec vllm_node bash -c "
sed -i '1s/^/import os\n/' /usr/local/lib/python3.12/dist-packages/vllm/model_executor/warmup/kernel_warmup.py
sed -i 's/if has_flashinfer() and current_platform.has_device_capability(90):/skip_autotune = os.environ.get(\"VLLM_SKIP_FLASHINFER_AUTOTUNE\", \"0\") == \"1\"\n if has_flashinfer() and current_platform.has_device_capability(90) and not skip_autotune:/' /usr/local/lib/python3.12/dist-packages/vllm/model_executor/warmup/kernel_warmup.py
"
# Launch WITHOUT --enforce-eager (CUDA graphs enabled!):
docker exec vllm_node bash -c "
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
export HF_HUB_OFFLINE=1
export VLLM_SKIP_FLASHINFER_AUTOTUNE=1
vllm serve lukealonso/MiniMax-M2-NVFP4 \
--host 0.0.0.0 --port 8000 \
--served-model-name minimax \
--trust-remote-code \
--gpu-memory-utilization 0.75 \
--pipeline-parallel-size 1 \
--enable-expert-parallel \
-tp 2 --distributed-executor-backend ray \
--max-model-len 32768 \
--max-num-seqs 32 \
--enable-auto-tool-choice \
--reasoning-parser minimax_m2_append_think \
--tool-call-parser minimax_m2 \
--all2all-backend pplx \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 16384 \
--dtype auto --kv-cache-dtype fp8 \
> /tmp/vllm.log 2>&1" &
Me:
Hello!
Model:
Thoughts
###
**
**
.
.
**!
**
** !**
Speed: ~25+ tok/s CUDA graphs enabled - but nonsense :)
Same with
--enforce-eager \
and minimal params (default)