Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node)

### 🎉 MiniMax-M2-NVFP4

No matter what parameter combos or hacks I throw at it, the tokenizer still spits out pure nonsense.

Here’s the max one:

# One-time patch to disable FlashInfer autotune:
docker exec vllm_node bash -c "
sed -i '1s/^/import os\n/' /usr/local/lib/python3.12/dist-packages/vllm/model_executor/warmup/kernel_warmup.py
sed -i 's/if has_flashinfer() and current_platform.has_device_capability(90):/skip_autotune = os.environ.get(\"VLLM_SKIP_FLASHINFER_AUTOTUNE\", \"0\") == \"1\"\n    if has_flashinfer() and current_platform.has_device_capability(90) and not skip_autotune:/' /usr/local/lib/python3.12/dist-packages/vllm/model_executor/warmup/kernel_warmup.py
"

# Launch WITHOUT --enforce-eager (CUDA graphs enabled!):
docker exec vllm_node bash -c "
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
export HF_HUB_OFFLINE=1
export VLLM_SKIP_FLASHINFER_AUTOTUNE=1

vllm serve lukealonso/MiniMax-M2-NVFP4 \
  --host 0.0.0.0 --port 8000 \
  --served-model-name minimax \
  --trust-remote-code \
  --gpu-memory-utilization 0.75 \
  --pipeline-parallel-size 1 \
  --enable-expert-parallel \
  -tp 2 --distributed-executor-backend ray \
  --max-model-len 32768 \
  --max-num-seqs 32 \
  --enable-auto-tool-choice \
  --reasoning-parser minimax_m2_append_think \
  --tool-call-parser minimax_m2 \
  --all2all-backend pplx \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-batched-tokens 16384 \
  --dtype auto --kv-cache-dtype fp8 \
  > /tmp/vllm.log 2>&1" &

Me:
Hello!

Model:

Thoughts

### 
** 
**
.  

.  

**!

 **

**   !**

Speed: ~25+ tok/s CUDA graphs enabled - but nonsense :)
Same with

  --enforce-eager \

and minimal params (default)