vLLM correctly loads weights (41/41 shards), then during profile_run:
INFO [flashinfer_utils.py:289] Flashinfer TRTLLM MOE backend is only supported on SM100 and later, using CUTLASS backend instead
INFO [modelopt.py:1142] Using FlashInfer CUTLASS kernels for ModelOptNvFp4FusedMoE.
...
RuntimeError: [FP4 gemm Runner] Failed to run cutlass FP4 gemm on sm120. Error: Error Internal
FlashInfer detects GB10 is not SM100 (B200), falls back to CUTLASS - but CUTLASS FP4 also fails.
Key Question
Are CUTLASS FP4 GEMM kernels compiled for GB10 (sm_121a)?
I see NVFP4 models tested on:
B200 (sm_100) ✅
H100/A100 with Marlin FP4 fallback ✅
But GB10 is sm_121 (Blackwell desktop/workstation variant). The error says sm120 which seems wrong - GB10 should be sm_121a.
It runs NVFP4 models just fine, but doesn’t have all the optimizations yet (or rather, VLLM doesn’t have all optimizations for sm121 yet). So they work, but AWQ quants are faster: PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM
This model will load, HOWEVER, even dual Sparks don’t have enough memory to keep KV cache. You can try with --enforce-eager to disable CUDA graph, then you may be able to squeeze some RAM for KV cache, but probably not too much to be useful.
You may also want to try QuantTrio/GLM-4.6-AWQ - it has a slightly smaller file size, and with --enable-expert-parallel you may be able to squeeze some context alongside this model.
Well, actually don’t use --enable-expert-parallel - it causes uneven memory utilization on two sparks.
I managed to run this AWQ quant on my dual sparks - getting 16 tokens/s and pretty decent prompt processing speeds (vllm showed ~720 t/s in logs, but it usually shows lower numbers there due to how prompts are split, I guess).
You can squeeze up to 50K context with this model: GPU KV cache size: 51,440 tokens
I ran with 8K just in case and got nice concurrency.
As for expert parallel, it won’t bring any performance advantages on dual Spark cluster. You may get slightly higher total throughput, but single request inference will be slower. It was designed to work together with data parallel on clusters with multiple GPUs on every node. On my system it also results in uneven memory utilization for some reason.
The MiniMax-M2-AWQ model is absolutely amazing, so much attention to details and prompt/context following. I’ve already used it for a couple of projects and it works great! However, I’m struggling to match the 40 t/s single-prompt throughput you mentioned.
I rebuilt your Docker image from the latest eugr/spark-vllm-docker (commit e0f6cff, Dec 18) and I’m getting ~26 tok/s for single prompt.
### 1 Prompt Benchmark:
============ Serving Benchmark Result ============
Successful requests: 1
Benchmark duration (s): 4.56
Total input tokens: 12
Total generated tokens: 119
Request throughput (req/s): 0.22
Output token throughput (tok/s): 26.10
Peak output token throughput (tok/s): 27.00
---------------Time to First Token----------------
Mean TTFT (ms): 169.21
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 37.21
---------------Inter-token Latency----------------
Mean ITL (ms): 37.21
==================================================
### 10 Prompts Benchmark:
============ Serving Benchmark Result ============
Successful requests: 10
Benchmark duration (s): 56.06
Total input tokens: 1354
Total generated tokens: 2665
Request throughput (req/s): 0.18
Output token throughput (tok/s): 47.54
Peak output token throughput (tok/s): 80.00
---------------Time to First Token----------------
Mean TTFT (ms): 488.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 116.26
---------------Inter-token Latency----------------
Mean ITL (ms): 88.74
==================================================
Is your Ray cluster active and using Infiniband? Something is definitely not right here. Can you post the vLLM launch logs?
You can also try my new launch-cluster.sh script. Then you can launch MiniMax M2 like this (launch on your desired head node). It will start the cluster on all nodes, autoconfigure interfaces and launch vllm with the model:
Auto-detecting interfaces...
Detected IB_IF: rocep1s0f1,roceP2p1s0f1. <--- MAKE SURE THESE ONES ARE THERE
Detected ETH_IF: enp1s0f1np1
Detected Local IP: 192.168.177.11 (192.168.177.11/24)
Auto-detecting nodes...
Scanning for SSH peers on 192.168.177.11/24...
Found peer: 192.168.177.12
Cluster Nodes: 192.168.177.11,192.168.177.12
Head Node: 192.168.177.11
Worker Nodes: 192.168.177.12
Container Name: vllm_node
Action: exec
Checking SSH connectivity to worker nodes...
SSH to 192.168.177.12: OK
Starting Head Node on 192.168.177.11...
779d2ffdee60ff0a6f821f9f9799f638182aa6d8839a5fc4b4ce79f1f71d6c18
Starting Worker Node on 192.168.177.12...
371dc5c5c9e159efd8a8741817f4ec50a9a02f4ba0a546056493fa2eb0b59444
Waiting for cluster to be ready...
Cluster head is responsive.
Executing command on head node: vllm serve QuantTrio/MiniMax-M2-AWQ --port 8888 --host 0.0.0.0 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --max-model-len 128000 --load-format fastsafetensors --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think
(APIServer pid=1025) INFO 12-20 04:52:12 [api_server.py:1262] vLLM API server version 0.14.0rc1.dev16+g969bbc7c6.d20251219
[ .. skip .. ]
(EngineCore_DP0 pid=1151) (RayWorkerWrapper pid=1237) spark:1237:1237 [0] NCCL INFO NCCL version 2.27.7+cuda13.0
(EngineCore_DP0 pid=1151) (RayWorkerWrapper pid=1237) spark:1237:1237 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so.
(EngineCore_DP0 pid=1151) (RayWorkerWrapper pid=1237) spark:1237:1237 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
(EngineCore_DP0 pid=1151) (RayWorkerWrapper pid=1237) spark:1237:1237 [0] NCCL INFO NCCL_IB_HCA set to rocep1s0f1,roceP2p1s0f1
(EngineCore_DP0 pid=1151) (RayWorkerWrapper pid=1237) spark:1237:1237 [0] NCCL INFO NET/IB : Using [0]rocep1s0f1:1/RoCE [1]roceP2p1s0f1:1/RoCE [RO]; OOB enp1s0f1np1:192.168.177.11<0>
(EngineCore_DP0 pid=1151) (RayWorkerWrapper pid=1237) spark:1237:1237 [0] NCCL INFO Initialized NET plugin IB
(EngineCore_DP0 pid=1151) (RayWorkerWrapper pid=1237) spark:1237:1237 [0] NCCL INFO Assigned NET plugin IB to comm [repeated 3x across cluster]
(EngineCore_DP0 pid=1151) (RayWorkerWrapper pid=1237) spark:1237:1237 [0] NCCL INFO Using network IB [repeated 3x across cluster]
[ .. skip .. ]
The most important lines in the log would have this info (I cut off the prefix, so it fits in the screen):
NCCL INFO NCCL version 2.27.7+cuda13.0
NCCL INFO NCCL_IB_DISABLE set by environment to 0.
NCCL INFO NCCL_IB_HCA set to rocep1s0f1,roceP2p1s0f1
NCCL INFO NET/IB : Using [0]rocep1s0f1:1/RoCE [1]roceP2p1s0f1:1/RoCE [RO]; OOB enp1s0f1np1:192.168.177.11<0>
NCCL INFO Initialized NET plugin IB
NCCL INFO Assigned NET plugin IB to comm [repeated 3x across cluster]
NCCL INFO Using network IB [repeated 3x across cluster]
If you don’t see both RoCE interfaces in “Using …” and you don’t see “Using network IB”, but have a lot of references to “Socket” or “Using network Eth” or something like that, it means that you’ll need to troubleshoot your Infiniband settings.
I updated to the latest launch-cluster.sh and NCCL looks fine and uses InfiniBand:
NCCL INFO NCCL_IB_DISABLE set by environment to 0.
NCCL INFO NCCL_IB_HCA set to rocep1s0f1,roceP2p1s0f1
NCCL INFO NET/IB : Using [0]rocep1s0f1:1/RoCE [1]roceP2p1s0f1:1/RoCE [RO]; OOB enP2p1s0f1np1:192.168.200.17<0>
NCCL INFO Initialized NET plugin IB
NCCL INFO Using network IB
All 16 channels show via NET/IB/0 and via NET/IB/1 ✅
However, the performance is still the same: ~26 tok/s for single prompt.
Also seeing these warnings:
WARNING: SymmMemCommunicator: Device capability 12.1 not supported
WARNING: Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
PyTorch supports (8.0) - (12.0)
=== Launch ===
Detected Local IP: 192.168.200.16 (192.168.200.16/24)
Head Node: 192.168.200.16
Worker Nodes: 192.168.200.17
Container Name: vllm_node
Action: exec
Checking SSH connectivity to worker nodes…
SSH to 192.168.200.17: OK
Starting Head Node on 192.168.200.16…
3b330021c56562aaf28f0802820de4f48e6067c9b8f4077fe8ca2b9c8d915ee1
Starting Worker Node on 192.168.200.17…
b7026a0736aedde527f510f5776ecc8a03c4dc11184c8b437c57c6f50a7fb1b7
Waiting for cluster to be ready…
Cluster head is responsive.
Executing command on head node: vllm serve QuantTrio/MiniMax-M2-AWQ --port 8000 …
(APIServer pid=1024) INFO 12-20 05:17:10 vLLM API server version 0.13.0rc2.dev288
[ … model loading … ]
(RayWorkerWrapper pid=1235) NCCL INFO NCCL_SOCKET_IFNAME set by environment to enP2p1s0f1np1
(RayWorkerWrapper pid=1235) NCCL INFO Bootstrap: Using enP2p1s0f1np1:192.168.200.16<0>
(RayWorkerWrapper pid=1235) NCCL INFO NCCL version 2.27.7+cuda13.0
(RayWorkerWrapper pid=1235) NCCL INFO NCCL_IB_DISABLE set by environment to 0.
(RayWorkerWrapper pid=1235) NCCL INFO NCCL_IB_HCA set to rocep1s0f1,roceP2p1s0f1
(RayWorkerWrapper pid=1235) NCCL INFO NET/IB : Using [0]rocep1s0f1:1/RoCE [1]roceP2p1s0f1:1/RoCE [RO]; OOB enP2p1s0f1np1:192.168.200.16<0>
(RayWorkerWrapper pid=1235) NCCL INFO Initialized NET plugin IB
(RayWorkerWrapper pid=1235) NCCL INFO Assigned NET plugin IB to comm
(RayWorkerWrapper pid=1235) NCCL INFO Using network IB
(RayWorkerWrapper pid=1235) Channel 00/0 : 0[0] → 1[0] [send] via NET/IB/0
(RayWorkerWrapper pid=1235) Channel 01/0 : 0[0] → 1[0] [send] via NET/IB/1
[ … all 16 channels via NET/IB/0 and NET/IB/1 … ]
(RayWorkerWrapper) WARNING: SymmMemCommunicator: Device capability 12.1 not supported
(RayWorkerWrapper) WARNING: Custom allreduce is disabled because this process group spans across nodes.
(RayWorkerWrapper) WARNING: Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
Minimum and Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0)
I noticed you’re running vLLM 0.14.0rc1 while I have 0.13.0rc2. Could that be the difference?
No. You can try to rebuild the container to see if there is any difference, but I was getting 40 t/s from the very beginning.
Can you post your full vllm bench serve command and the new results?
One thing I noticed is that you’ve assigned IP to enP2p1s0f1np1? NVIDIA recommends using it’s “primary” twin - enp1s0f1np1. I wonder whether there is a difference. It shouldn’t be, but just in case.
Another possibility is that there are some other resource-consuming processes running on one of the nodes.
============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 3.07
Total input tokens: 12
Total generated tokens: 119
Request throughput (req/s): 0.33
Output token throughput (tok/s): 38.77
Peak output token throughput (tok/s): 40.00
Peak concurrent requests: 1.00
Total token throughput (tok/s): 42.68
---------------Time to First Token----------------
Mean TTFT (ms): 110.34
Median TTFT (ms): 110.34
P99 TTFT (ms): 110.34
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 25.07
Median TPOT (ms): 25.07
P99 TPOT (ms): 25.07
---------------Inter-token Latency----------------
Mean ITL (ms): 25.07
Median ITL (ms): 24.85
P99 ITL (ms): 27.24
==================================================
I guess, you are getting pretty much the same performance now. I’d make sure there is no other unnecessary traffic between two nodes via ConnectX interface other than related to vLLM. Use “normal” ethernet interface for everything else (other than copying images from one Spark to another or similar large transfers).
No, I have Founders Edition ones. Haven’t had any issues with cooling so far, but I made sure their airflow is not obstructed - plenty of clearance around.