Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node)

letsrock85 · December 8, 2025, 2:25am

### 🎉 MiniMax-M2-NVFP4

No matter what parameter combos or hacks I throw at it, the tokenizer still spits out pure nonsense.

Here’s the max one:

# One-time patch to disable FlashInfer autotune:
docker exec vllm_node bash -c "
sed -i '1s/^/import os\n/' /usr/local/lib/python3.12/dist-packages/vllm/model_executor/warmup/kernel_warmup.py
sed -i 's/if has_flashinfer() and current_platform.has_device_capability(90):/skip_autotune = os.environ.get(\"VLLM_SKIP_FLASHINFER_AUTOTUNE\", \"0\") == \"1\"\n    if has_flashinfer() and current_platform.has_device_capability(90) and not skip_autotune:/' /usr/local/lib/python3.12/dist-packages/vllm/model_executor/warmup/kernel_warmup.py
"

# Launch WITHOUT --enforce-eager (CUDA graphs enabled!):
docker exec vllm_node bash -c "
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
export HF_HUB_OFFLINE=1
export VLLM_SKIP_FLASHINFER_AUTOTUNE=1

vllm serve lukealonso/MiniMax-M2-NVFP4 \
  --host 0.0.0.0 --port 8000 \
  --served-model-name minimax \
  --trust-remote-code \
  --gpu-memory-utilization 0.75 \
  --pipeline-parallel-size 1 \
  --enable-expert-parallel \
  -tp 2 --distributed-executor-backend ray \
  --max-model-len 32768 \
  --max-num-seqs 32 \
  --enable-auto-tool-choice \
  --reasoning-parser minimax_m2_append_think \
  --tool-call-parser minimax_m2 \
  --all2all-backend pplx \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-batched-tokens 16384 \
  --dtype auto --kv-cache-dtype fp8 \
  > /tmp/vllm.log 2>&1" &

Me:
Hello!

Model:

Thoughts

### 
** 
**
.  

.  

**!

 **

**   !**

Speed: ~25+ tok/s CUDA graphs enabled - but nonsense :)
Same with

  --enforce-eager \

and minimal params (default)

Topic		Replies	Views
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	841	February 13, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	141	3758	March 11, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2159	December 31, 2025
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	3809	February 27, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	29	2328	March 10, 2026
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	33	1641	January 2, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	24	1721	January 11, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	4382	December 9, 2025
MiniMax M2.5 released (not available on HuggingFace as of now) -- is DGX Spark ready? DGX Spark / GB10	55	3722	March 1, 2026
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	144	5322	March 10, 2026

Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node)

Related topics