Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node)

Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node)

Hardware

  • 2x DGX Spark (GB10 GPU each, sm_121a / compute capability 12.1)

  • Connected via 200GbE ConnectX-7/Ethernet

  • Driver: 580.95.05, Host CUDA: 13.0

Goal

Run lukealonso/GLM-4.6-NVFP4 (357B MoE model, NVFP4 quantization) across both nodes using vLLM with Ray distributed backend.

What I’ve Tried

1. nvcr.io/nvidia/vllm:25.11-py3 (NGC)

  • vLLM 0.11.0

  • Error: FlashInfer kernels unavailable for ModelOptNvFp4FusedMoE on current platform

  • NVFP4 requires vLLM 0.12.0+

2. vllm/vllm-openai:nightly-aarch64 (vLLM 0.11.2.dev575)

  • With VLLM_USE_FLASHINFER_MOE_FP4=1

  • Error: ptxas fatal: Value 'sm_121a' is not defined for option 'gpu-name'

  • Triton’s bundled ptxas 12.8 doesn’t support GB10

3. vllm/vllm-openai:v0.12.0-aarch64 (vLLM 0.12.0)

  • Fixed ptxas with symlink: ln -sf /usr/local/cuda/bin/ptxas /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas

  • Triton compilation passes ✅

  • Error: RuntimeError: [FP4 gemm Runner] Failed to run cutlass FP4 gemm on sm120. Error: Error Internal

4. Tried both parallelism modes:

  • --tensor-parallel-size 2 → same CUTLASS error

  • --pipeline-parallel-size 2 → same CUTLASS error

5. --enforce-eager flag

  • Not fully tested yet

Environment Details

| Component | Version |

|-----------|---------|

| Host Driver | 580.95.05 |

| Host CUDA | 13.0 |

| Container CUDA | 12.9 |

| Container ptxas | 12.9.86 (supports sm_121a ✅) |

| Triton bundled ptxas | 12.8 (NO sm_121a ❌) |

| PyTorch | 2.9.0+cu129 |

The Blocking Error

vLLM correctly loads weights (41/41 shards), then during profile_run:


INFO [flashinfer_utils.py:289] Flashinfer TRTLLM MOE backend is only supported on SM100 and later, using CUTLASS backend instead

INFO [modelopt.py:1142] Using FlashInfer CUTLASS kernels for ModelOptNvFp4FusedMoE.

...

RuntimeError: [FP4 gemm Runner] Failed to run cutlass FP4 gemm on sm120. Error: Error Internal

FlashInfer detects GB10 is not SM100 (B200), falls back to CUTLASS - but CUTLASS FP4 also fails.

Key Question

Are CUTLASS FP4 GEMM kernels compiled for GB10 (sm_121a)?

I see NVFP4 models tested on:

  • B200 (sm_100) ✅

  • H100/A100 with Marlin FP4 fallback ✅

But GB10 is sm_121 (Blackwell desktop/workstation variant). The error says sm120 which seems wrong - GB10 should be sm_121a.

Is there:

  1. A vLLM build with CUTLASS kernels for sm_121?

  2. A way to force Marlin FP4 fallback on GB10?

  3. Recommended Docker image for DGX Spark + NVFP4?

References

Thanks!

You can try my Docker build: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

It runs NVFP4 models just fine, but doesn’t have all the optimizations yet (or rather, VLLM doesn’t have all optimizations for sm121 yet). So they work, but AWQ quants are faster: PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM

This model will load, HOWEVER, even dual Sparks don’t have enough memory to keep KV cache. You can try with --enforce-eager to disable CUDA graph, then you may be able to squeeze some RAM for KV cache, but probably not too much to be useful.

1 Like

You may also want to try QuantTrio/GLM-4.6-AWQ - it has a slightly smaller file size, and with --enable-expert-parallel you may be able to squeeze some context alongside this model.

Well, actually don’t use --enable-expert-parallel - it causes uneven memory utilization on two sparks.

I managed to run this AWQ quant on my dual sparks - getting 16 tokens/s and pretty decent prompt processing speeds (vllm showed ~720 t/s in logs, but it usually shows lower numbers there due to how prompts are split, I guess).

You can squeeze up to 50K context with this model: GPU KV cache size: 51,440 tokens

I ran with 8K just in case and got nice concurrency.

Some benches:

vllm serve QuantTrio/GLM-4.6-AWQ --gpu-memory-utilization 0.85 --max-model-len 8192 -tp 2 --distributed-executor-backend ray --enable-auto-tool-choice --tool-call-parser glm45 --reasoning-parser glm45 --host 0.0.0.0 --port 8888
vllm bench serve   --backend vllm   --model QuantTrio/GLM-4.6-AWQ   --endpoint /v1/completions   --dataset-name sharegpt   --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json   --num-prompts 1   --port 8888 --host spark
============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  7.58
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.13
Output token throughput (tok/s):         15.70
Peak output token throughput (tok/s):    17.00
Peak concurrent requests:                1.00
Total Token throughput (tok/s):          17.28
---------------Time to First Token----------------
Mean TTFT (ms):                          211.99
Median TTFT (ms):                        211.99
P99 TTFT (ms):                           211.99
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          62.43
Median TPOT (ms):                        62.43
P99 TPOT (ms):                           62.43
---------------Inter-token Latency----------------
Mean ITL (ms):                           62.43
Median ITL (ms):                         60.99
P99 ITL (ms):                            75.67
==================================================
                            Output tokens per second
  50 +----------------------------------------------------------------------+
     | *   *                                                                |
  45 | *   *                                                                |
     | **** *****                                                           |
  40 | *         *                                                          |
     | *         **** *** * *                                               |
  35 | *             **  ****                                               |
     | *              *  *** *                                              |
  30 | *                     *                                              |
  25 |*                      ** ** **** ****** *   *    * * * * *  *        |
     |*                        *  *    *    * * *** * ** * * * * ****       |
  20 |*                                             **               *      |
     |*                                                              *      |
  15 |*                                                               *     |
     |*                                                               *     |
  10 |*                                                               *     |
     |*                                                               *     |
   5 |*                                                               *     |
     |                                                                *     |
   0 +----------------------------------------------------------------------+
     0       10      20      30      40     50      60      70      80      90

                         Concurrent requests per second
  10 +----------------------------------------------------------------------+
     |*                                                                     |
     | *                                                                    |
     | *                                                                    |
   8 | *                                                                    |
     |  *                                                                   |
     |  *********                                                           |
     |           *                                                          |
   6 |           *****                                                      |
     |                *                                                     |
     |                *                                                     |
   4 |                ********                                              |
     |                       *                                              |
     |                       ******************                             |
     |                                         *                            |
   2 |                                         ***********************      |
     |                                                                *     |
     |                                                                *     |
     |                                                                *     |
   0 +----------------------------------------------------------------------+
     0       10      20      30      40     50      60      70      80      90
============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  82.92
Total input tokens:                      1371
Total generated tokens:                  2414
Request throughput (req/s):              0.12
Output token throughput (tok/s):         29.11
Peak output token throughput (tok/s):    48.00
Peak concurrent requests:                10.00
Total Token throughput (tok/s):          45.65
---------------Time to First Token----------------
Mean TTFT (ms):                          1736.73
Median TTFT (ms):                        1895.35
P99 TTFT (ms):                           1896.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          144.26
Median TPOT (ms):                        155.90
P99 TPOT (ms):                           165.97
---------------Inter-token Latency----------------
Mean ITL (ms):                           121.81
Median ITL (ms):                         113.41
P99 ITL (ms):                            186.84
==================================================
                             Output tokens per second
  250 +---------------------------------------------------------------------+
      |                                                                     |
      |    *                                                                |
      |    **                                                               |
  200 |    ***                                                              |
      |   *********                                                         |
      |   **************                                                    |
      |   ***************                                                   |
  150 |   *******************                                               |
      |   * *********************                                           |
      |   *    *********************                                        |
  100 |   *            ***************                                      |
      |   *               *****************    * *                          |
      |  **                      ****************** **                      |
      |  *                         ** ***** *************                   |
   50 | **                                * ****     ******* **             |
      |***                                            *  * ********         |
      |***                                                      *******     |
      |***                                                            *     |
    0 +---------------------------------------------------------------------+
      0             50           100           150           200           250

                          Concurrent requests per second
  100 +---------------------------------------------------------------------+
      |*                                                                    |
      | **                                                                  |
      |  **                                                                 |
   80 |   **                                                                |
      |    **                                                               |
      |     **                                                              |
      |      ****                                                           |
   60 |         ********                                                    |
      |                 ***                                                 |
      |                   ******                                            |
   40 |                        ***                                          |
      |                           ***                                       |
      |                             *******                                 |
      |                                   *******                           |
   20 |                                          ******                     |
      |                                               ***                   |
      |                                                 *******             |
      |                                                       ********      |
    0 +---------------------------------------------------------------------+
      0             50           100           150           200           250
============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Benchmark duration (s):                  229.51
Total input tokens:                      22992
Total generated tokens:                  19483
Request throughput (req/s):              0.44
Output token throughput (tok/s):         84.89
Peak output token throughput (tok/s):    229.00
Peak concurrent requests:                100.00
Total Token throughput (tok/s):          185.07
---------------Time to First Token----------------
Mean TTFT (ms):                          6768.84
Median TTFT (ms):                        6933.76
P99 TTFT (ms):                           14792.45
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          563.80
Median TPOT (ms):                        457.68
P99 TPOT (ms):                           1727.53
---------------Inter-token Latency----------------
Mean ITL (ms):                           401.45
Median ITL (ms):                         402.76
P99 ITL (ms):                            1709.23
==================================================

Hey @eugr, huge thanks! Your spark-vllm-docker saved my day 🙏

Got lukealonso/GLM-4.6-NVFP4 running on dual Sparks with --enforce-eager. Works!

Surprising: NVFP4 generation is ~20% slower than W4A16 - not the 1.7x NVIDIA advertised. Probably sm121 optimizations aren’t ready yet.


My Benchmarks (2x DGX Spark, TP=2, Ray)

docker exec vllm_node bash -c "HF_HUB_OFFLINE=1 vllm serve lukealonso/GLM-4.6-NVFP4 \
  --served-model-name glm \
  --host 0.0.0.0 --port 8000 \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --gpu-memory-utilization 0.85 \
  --max-model-len 16384 \
  --enforce-eager \
  --trust-remote-code"

lukealonso/GLM-4.6-NVFP4 (357B, NVFP4)

| Metric | My (2x Spark) | 0xSero (8x 3090) |

|--------|-----------------|------------------|

| Prefill | **~1045 tok/s** ✅ | ~889 tok/s |

| Generation | ~9.6 tok/s | ~31 tok/s |

| TTFT (10K ctx) | 9.59s | 23.82s |

| Serving throughput | 19.88 tok/s | - |

Prefill faster than 8x 3090! Generation slower due to --enforce-eager.

GLM-4.6-REAP-218B-A32B-W4A16-AutoRound (MoE, W4A16)

| Metric | my (2x Spark) | 0xSero (8x 3090) |

|--------|-----------------|------------------|

| Prefill | ~391 tok/s | ~889 tok/s |

| Generation | ~10.4 tok/s | ~31 tok/s |

| TTFT (20K ctx) | 51.17s | 23.82s |

| Serving throughput | 17.94 tok/s | - |

cerebras/MiniMax-M2-REAP-162B (MXFP4)

~18 tok/s via llama.cpp RPC (vLLM doesn’t support cerebras arch yet)


Haven’t tried QuantTrio/GLM-4.6-AWQ yet - your 16 tok/s looks promising! Will test soon.


Key Findings (vLLM + DGX Spark)

  • NVFP4 works only with your spark-vllm-docker (TORCH_CUDA_ARCH_LIST=12.1a)

  • Standard images fail: NGC (no FlashInfer), nightly (ptxas sm_121a), v0.12.0 (CUTLASS sm120)

  • --enforce-eager mandatory for NVFP4 - CUDA graphs crash

  • W4A16 currently faster than NVFP4 (no sm121 optimizations yet)

  • nccl/ray architecture not in vLLM yet

Thanks again! 🚀

1 Like

You can run the original MiniMax M2 though. I’m getting 40 t/s on dual Sparks with AWQ quant:

vllm serve QuantTrio/MiniMax-M2-AWQ --port 8888 --host 0.0.0.0 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --max-model-len 128000 --load-format fastsafetensors --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think
vllm bench serve   --backend vllm   --model QuantTrio/MiniMax-M2-AWQ   --endpoint /v1/completions   --dataset-name sharegpt   --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json   --num-prompts 1   --port 8888 --host spark
 
============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  3.04
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.33
Output token throughput (tok/s):         39.09
Peak output token throughput (tok/s):    41.00
Peak concurrent requests:                1.00
Total Token throughput (tok/s):          43.03
---------------Time to First Token----------------
Mean TTFT (ms):                          111.71
Median TTFT (ms):                        111.71
P99 TTFT (ms):                           111.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          24.85
Median TPOT (ms):                        24.85
P99 TPOT (ms):                           24.85
---------------Inter-token Latency----------------
Mean ITL (ms):                           24.85
Median ITL (ms):                         24.63
P99 ITL (ms):                            34.96
==================================================
============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  37.88
Total input tokens:                      1354
Total generated tokens:                  2665
Request throughput (req/s):              0.26
Output token throughput (tok/s):         70.35
Peak output token throughput (tok/s):    120.00
Peak concurrent requests:                10.00
Total Token throughput (tok/s):          106.09
---------------Time to First Token----------------
Mean TTFT (ms):                          788.35
Median TTFT (ms):                        853.59
P99 TTFT (ms):                           855.16
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          76.43
Median TPOT (ms):                        77.61
P99 TPOT (ms):                           107.98
---------------Inter-token Latency----------------
Mean ITL (ms):                           59.09
Median ITL (ms):                         57.67
P99 ITL (ms):                            117.57
==================================================

When you run that one, don’t use enforce-eager, or the performance will be very bad. With CUDA graphs you will still be able to fit about 50K context.

1 Like
### 🎉 MiniMax-M2-NVFP4

No matter what parameter combos or hacks I throw at it, the tokenizer still spits out pure nonsense.

Here’s the max one:

# One-time patch to disable FlashInfer autotune:
docker exec vllm_node bash -c "
sed -i '1s/^/import os\n/' /usr/local/lib/python3.12/dist-packages/vllm/model_executor/warmup/kernel_warmup.py
sed -i 's/if has_flashinfer() and current_platform.has_device_capability(90):/skip_autotune = os.environ.get(\"VLLM_SKIP_FLASHINFER_AUTOTUNE\", \"0\") == \"1\"\n    if has_flashinfer() and current_platform.has_device_capability(90) and not skip_autotune:/' /usr/local/lib/python3.12/dist-packages/vllm/model_executor/warmup/kernel_warmup.py
"

# Launch WITHOUT --enforce-eager (CUDA graphs enabled!):
docker exec vllm_node bash -c "
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
export HF_HUB_OFFLINE=1
export VLLM_SKIP_FLASHINFER_AUTOTUNE=1

vllm serve lukealonso/MiniMax-M2-NVFP4 \
  --host 0.0.0.0 --port 8000 \
  --served-model-name minimax \
  --trust-remote-code \
  --gpu-memory-utilization 0.75 \
  --pipeline-parallel-size 1 \
  --enable-expert-parallel \
  -tp 2 --distributed-executor-backend ray \
  --max-model-len 32768 \
  --max-num-seqs 32 \
  --enable-auto-tool-choice \
  --reasoning-parser minimax_m2_append_think \
  --tool-call-parser minimax_m2 \
  --all2all-backend pplx \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-batched-tokens 16384 \
  --dtype auto --kv-cache-dtype fp8 \
  > /tmp/vllm.log 2>&1" &

Me:
Hello!

Model:

Thoughts

### 
** 
**
.  

.  

**!

 **

**   !**

Speed: ~25+ tok/s CUDA graphs enabled - but nonsense :)
Same with

  --enforce-eager \

and minimal params (default)

NVFP4 support is not fully functional on sm121 yet, just use AWQ quant: QuantTrio/MiniMax-M2-AWQ
You’ll get 40 t/s on dual sparks.

This is how I launch it in my container:

vllm serve QuantTrio/MiniMax-M2-AWQ --port 8888 --host 0.0.0.0 \
  --gpu-memory-utilization 0.7 \
  -tp 2 \
  --distributed-executor-backend ray \
  --max-model-len 128000 \
  --load-format fastsafetensors \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think

As for expert parallel, it won’t bring any performance advantages on dual Spark cluster. You may get slightly higher total throughput, but single request inference will be slower. It was designed to work together with data parallel on clusters with multiple GPUs on every node. On my system it also results in uneven memory utilization for some reason.

Hey @eugr!

The MiniMax-M2-AWQ model is absolutely amazing, so much attention to details and prompt/context following. I’ve already used it for a couple of projects and it works great! However, I’m struggling to match the 40 t/s single-prompt throughput you mentioned.

I rebuilt your Docker image from the latest eugr/spark-vllm-docker (commit e0f6cff, Dec 18) and I’m getting ~26 tok/s for single prompt.

### 1 Prompt Benchmark:

============ Serving Benchmark Result ============

Successful requests:                     1         

Benchmark duration (s):                  4.56      

Total input tokens:                      12        

Total generated tokens:                  119       

Request throughput (req/s):              0.22      

Output token throughput (tok/s):         26.10     

Peak output token throughput (tok/s):    27.00     

---------------Time to First Token----------------

Mean TTFT (ms):                          169.21    

-----Time per Output Token (excl. 1st token)------

Mean TPOT (ms):                          37.21     

---------------Inter-token Latency----------------

Mean ITL (ms):                           37.21     

==================================================
### 10 Prompts Benchmark:

============ Serving Benchmark Result ============

Successful requests:                     10        

Benchmark duration (s):                  56.06     

Total input tokens:                      1354      

Total generated tokens:                  2665      

Request throughput (req/s):              0.18      

Output token throughput (tok/s):         47.54     

Peak output token throughput (tok/s):    80.00     

---------------Time to First Token----------------

Mean TTFT (ms):                          488.32    

-----Time per Output Token (excl. 1st token)------

Mean TPOT (ms):                          116.26    

---------------Inter-token Latency----------------

Mean ITL (ms):                           88.74     

==================================================

Here’s my launch command:

docker exec -d vllm_node bash -i -c "vllm serve QuantTrio/MiniMax-M2-AWQ \\

  --port 8000 --host 0.0.0.0 \\

  --served-model-name minimax \\

  --gpu-memory-utilization 0.7 \\

  -tp 2 \\

  --distributed-executor-backend ray \\

  --max-model-len 128000 \\

  --load-format fastsafetensors \\

  --trust-remote-code \\

  --enable-auto-tool-choice \\

  --tool-call-parser minimax_m2 \\

  --reasoning-parser minimax_m2_append_think"

I’m using bash -i -c to ensure NCCL variables from run-cluster-node.sh are loaded. Verified that NCCL_IB_HCA, NCCL_SOCKET_IFNAME are set correctly.

Any ideas what might be causing the ~35% performance gap? Are there any additional parameters or settings you’re using?

Thanks for your awesome Docker setup btw - we all own you a packs of coffee!

Is your Ray cluster active and using Infiniband? Something is definitely not right here. Can you post the vLLM launch logs?

You can also try my new launch-cluster.sh script. Then you can launch MiniMax M2 like this (launch on your desired head node). It will start the cluster on all nodes, autoconfigure interfaces and launch vllm with the model:

./launch-cluster.sh --nccl-debug  \
        exec vllm serve QuantTrio/MiniMax-M2-AWQ \
        --port 8000 --host 0.0.0.0 \
        --gpu-memory-utilization 0.7 \
        -tp 2 --distributed-executor-backend ray \
        --max-model-len 128000 --load-format fastsafetensors \
        --enable-auto-tool-choice --tool-call-parser minimax_m2 \
        --reasoning-parser minimax_m2_append_think

You should see something like this:

Auto-detecting interfaces...
  Detected IB_IF: rocep1s0f1,roceP2p1s0f1. <--- MAKE SURE THESE ONES ARE THERE
  Detected ETH_IF: enp1s0f1np1
  Detected Local IP: 192.168.177.11 (192.168.177.11/24)
Auto-detecting nodes...
  Scanning for SSH peers on 192.168.177.11/24...
  Found peer: 192.168.177.12
  Cluster Nodes: 192.168.177.11,192.168.177.12
Head Node: 192.168.177.11
Worker Nodes: 192.168.177.12
Container Name: vllm_node
Action: exec
Checking SSH connectivity to worker nodes...
  SSH to 192.168.177.12: OK
Starting Head Node on 192.168.177.11...
779d2ffdee60ff0a6f821f9f9799f638182aa6d8839a5fc4b4ce79f1f71d6c18
Starting Worker Node on 192.168.177.12...
371dc5c5c9e159efd8a8741817f4ec50a9a02f4ba0a546056493fa2eb0b59444
Waiting for cluster to be ready...
Cluster head is responsive.
Executing command on head node: vllm serve QuantTrio/MiniMax-M2-AWQ --port 8888 --host 0.0.0.0 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --max-model-len 128000 --load-format fastsafetensors --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think
(APIServer pid=1025) INFO 12-20 04:52:12 [api_server.py:1262] vLLM API server version 0.14.0rc1.dev16+g969bbc7c6.d20251219

[ .. skip .. ]

(EngineCore_DP0 pid=1151) (RayWorkerWrapper pid=1237) spark:1237:1237 [0] NCCL INFO NCCL version 2.27.7+cuda13.0
(EngineCore_DP0 pid=1151) (RayWorkerWrapper pid=1237) spark:1237:1237 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so.
(EngineCore_DP0 pid=1151) (RayWorkerWrapper pid=1237) spark:1237:1237 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
(EngineCore_DP0 pid=1151) (RayWorkerWrapper pid=1237) spark:1237:1237 [0] NCCL INFO NCCL_IB_HCA set to rocep1s0f1,roceP2p1s0f1
(EngineCore_DP0 pid=1151) (RayWorkerWrapper pid=1237) spark:1237:1237 [0] NCCL INFO NET/IB : Using [0]rocep1s0f1:1/RoCE [1]roceP2p1s0f1:1/RoCE [RO]; OOB enp1s0f1np1:192.168.177.11<0>
(EngineCore_DP0 pid=1151) (RayWorkerWrapper pid=1237) spark:1237:1237 [0] NCCL INFO Initialized NET plugin IB
(EngineCore_DP0 pid=1151) (RayWorkerWrapper pid=1237) spark:1237:1237 [0] NCCL INFO Assigned NET plugin IB to comm [repeated 3x across cluster]
(EngineCore_DP0 pid=1151) (RayWorkerWrapper pid=1237) spark:1237:1237 [0] NCCL INFO Using network IB [repeated 3x across cluster]

[ .. skip .. ]

The most important lines in the log would have this info (I cut off the prefix, so it fits in the screen):

NCCL INFO NCCL version 2.27.7+cuda13.0
NCCL INFO NCCL_IB_DISABLE set by environment to 0.
NCCL INFO NCCL_IB_HCA set to rocep1s0f1,roceP2p1s0f1
NCCL INFO NET/IB : Using [0]rocep1s0f1:1/RoCE [1]roceP2p1s0f1:1/RoCE [RO]; OOB enp1s0f1np1:192.168.177.11<0>
NCCL INFO Initialized NET plugin IB
NCCL INFO Assigned NET plugin IB to comm [repeated 3x across cluster]
NCCL INFO Using network IB [repeated 3x across cluster]

If you don’t see both RoCE interfaces in “Using …” and you don’t see “Using network IB”, but have a lot of references to “Socket” or “Using network Eth” or something like that, it means that you’ll need to troubleshoot your Infiniband settings.

I updated to the latest launch-cluster.sh and NCCL looks fine and uses InfiniBand:

NCCL INFO NCCL_IB_DISABLE set by environment to 0.
NCCL INFO NCCL_IB_HCA set to rocep1s0f1,roceP2p1s0f1
NCCL INFO NET/IB : Using [0]rocep1s0f1:1/RoCE [1]roceP2p1s0f1:1/RoCE [RO]; OOB enP2p1s0f1np1:192.168.200.17<0>
NCCL INFO Initialized NET plugin IB
NCCL INFO Using network IB

All 16 channels show via NET/IB/0 and via NET/IB/1 ✅

However, the performance is still the same: ~26 tok/s for single prompt.

Also seeing these warnings:

WARNING: SymmMemCommunicator: Device capability 12.1 not supported
WARNING: Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
PyTorch supports (8.0) - (12.0)

Cluster Start command:

./launch-cluster.sh
–nodes 192.168.200.16,192.168.200.17
–eth-if enP2p1s0f1np1
–ib-if rocep1s0f1,roceP2p1s0f1
-t vllm-node-new
–nccl-debug
exec vllm serve QuantTrio/MiniMax-M2-AWQ
–port 8000 --host 0.0.0.0
–gpu-memory-utilization 0.7
-tp 2 --distributed-executor-backend ray
–max-model-len 128000 --load-format fastsafetensors
–enable-auto-tool-choice --tool-call-parser minimax_m2
–reasoning-parser minimax_m2_append_think 2>&1 | tee /tmp/vllm_full_log.txt | head -250

full log:

=== Launch ===
Detected Local IP: 192.168.200.16 (192.168.200.16/24)
Head Node: 192.168.200.16
Worker Nodes: 192.168.200.17
Container Name: vllm_node
Action: exec
Checking SSH connectivity to worker nodes…
SSH to 192.168.200.17: OK
Starting Head Node on 192.168.200.16…
3b330021c56562aaf28f0802820de4f48e6067c9b8f4077fe8ca2b9c8d915ee1
Starting Worker Node on 192.168.200.17…
b7026a0736aedde527f510f5776ecc8a03c4dc11184c8b437c57c6f50a7fb1b7
Waiting for cluster to be ready…
Cluster head is responsive.
Executing command on head node: vllm serve QuantTrio/MiniMax-M2-AWQ --port 8000 …

(APIServer pid=1024) INFO 12-20 05:17:10 vLLM API server version 0.13.0rc2.dev288

[ … model loading … ]

(RayWorkerWrapper pid=1235) NCCL INFO NCCL_SOCKET_IFNAME set by environment to enP2p1s0f1np1
(RayWorkerWrapper pid=1235) NCCL INFO Bootstrap: Using enP2p1s0f1np1:192.168.200.16<0>
(RayWorkerWrapper pid=1235) NCCL INFO NCCL version 2.27.7+cuda13.0
(RayWorkerWrapper pid=1235) NCCL INFO NCCL_IB_DISABLE set by environment to 0.
(RayWorkerWrapper pid=1235) NCCL INFO NCCL_IB_HCA set to rocep1s0f1,roceP2p1s0f1
(RayWorkerWrapper pid=1235) NCCL INFO NET/IB : Using [0]rocep1s0f1:1/RoCE [1]roceP2p1s0f1:1/RoCE [RO]; OOB enP2p1s0f1np1:192.168.200.16<0>
(RayWorkerWrapper pid=1235) NCCL INFO Initialized NET plugin IB
(RayWorkerWrapper pid=1235) NCCL INFO Assigned NET plugin IB to comm
(RayWorkerWrapper pid=1235) NCCL INFO Using network IB

(RayWorkerWrapper pid=1235) Channel 00/0 : 0[0] → 1[0] [send] via NET/IB/0
(RayWorkerWrapper pid=1235) Channel 01/0 : 0[0] → 1[0] [send] via NET/IB/1
[ … all 16 channels via NET/IB/0 and NET/IB/1 … ]

(RayWorkerWrapper) WARNING: SymmMemCommunicator: Device capability 12.1 not supported
(RayWorkerWrapper) WARNING: Custom allreduce is disabled because this process group spans across nodes.
(RayWorkerWrapper) WARNING: Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
Minimum and Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0)

I noticed you’re running vLLM 0.14.0rc1 while I have 0.13.0rc2. Could that be the difference?

No. You can try to rebuild the container to see if there is any difference, but I was getting 40 t/s from the very beginning.

Can you post your full vllm bench serve command and the new results?

One thing I noticed is that you’ve assigned IP to enP2p1s0f1np1? NVIDIA recommends using it’s “primary” twin - enp1s0f1np1. I wonder whether there is a difference. It shouldn’t be, but just in case.

Another possibility is that there are some other resource-consuming processes running on one of the nodes.

letsrock85@gx10-4e07:~$ docker exec vllm_node vllm bench serve --backend vllm --model QuantTrio/MiniMax-M2-AWQ --endpoint /v1/completions --dataset-name sharegpt --dataset-path /tmp/ShareGPT_V3.json --num-prompts 1 --port 8000 --host localhost 2>&1 | tail -20

Total input tokens:                      12        
Total generated tokens:                  119       
Request throughput (req/s):              0.32      
Output token throughput (tok/s):         37.63     
Peak output token throughput (tok/s):    39.00     
Peak concurrent requests:                1.00      
Total token throughput (tok/s):          41.42     
---------------Time to First Token----------------
Mean TTFT (ms):                          107.63    
Median TTFT (ms):                        107.63    
P99 TTFT (ms):                           107.63    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          25.89     
Median TPOT (ms):                        25.89     
P99 TPOT (ms):                           25.89     
---------------Inter-token Latency----------------
Mean ITL (ms):                           25.89     
Median ITL (ms):                         25.90     
P99 ITL (ms):                            28.17     
==================================================

Getting better now - it was just the moonlight/sunshine thing. My bad: I forgot the stream was coming from the worker spark.

BTW, do you happen to have the MSI OEM DGX Spark? People say the MSI version runs about 10 % faster because the cooling is beefier.

This is my yesterday result:

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  3.07
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.33
Output token throughput (tok/s):         38.77
Peak output token throughput (tok/s):    40.00
Peak concurrent requests:                1.00
Total token throughput (tok/s):          42.68
---------------Time to First Token----------------
Mean TTFT (ms):                          110.34
Median TTFT (ms):                        110.34
P99 TTFT (ms):                           110.34
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          25.07
Median TPOT (ms):                        25.07
P99 TPOT (ms):                           25.07
---------------Inter-token Latency----------------
Mean ITL (ms):                           25.07
Median ITL (ms):                         24.85
P99 ITL (ms):                            27.24
==================================================

I guess, you are getting pretty much the same performance now. I’d make sure there is no other unnecessary traffic between two nodes via ConnectX interface other than related to vLLM. Use “normal” ethernet interface for everything else (other than copying images from one Spark to another or similar large transfers).

No, I have Founders Edition ones. Haven’t had any issues with cooling so far, but I made sure their airflow is not obstructed - plenty of clearance around.