Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever

Phaserblast · January 26, 2026, 11:25pm

I’ve got a two-Spark cluster running okay with a very fast connection (24.38 GB/sec using MTU 9000), and trying to run vLLM in tensor parallel mode. I am running gpt-oss-20b. This setup works fine with the vllm option --pipeline-parallel-size 2, but that doesn’t use both GPUs to 100%. Tensor parallelism mode does utilize both GPUs to 100%, but the problem I have is after the first prompt is processed by the model, one node drops out while the other spins its GPU at 100% indefinitely until I kill vllm. Sometimes it just stops in the middle of processing the first prompt and hangs. Here are my vllm launch args:

vllm serve $MODEL_PATH \
    --distributed-executor-backend ray \
    --tensor-parallel-size 2 \
    --pipeline-parallel-size 1 \
    --load-format auto \
    --gpu-memory-utilization 0.7 \
    --attention-config.backend TRITON_ATTN \
    --disable-log-stats \
    --dtype auto \
    --trust-remote-code \
    --max-model-len 8192 \
    --max-num-batched-tokens 8192 \
    --served-model-name gpt-oss-20b

I launch ray using the recommended method, nothing fancy, setting all the envars to the appropriate IF name:

export IF_NAME=enp1s0f1np1
export NCCL_SOCKET_IFNAME=$IF_NAME
export OMPI_MCA_btl_tcp_if_include=$IF_NAME
export UCX_NET_DEVICES=$IF_NAME
export TP_SOCKET_IFNAME=$IF_NAME
export GLOO_SOCKET_IFNAME=$IF_NAME

Again, this works okay if I launch vllm with --tensor-parallel-size 1 --pipeline-parallel-size 2.

Any ideas?

eugr · January 26, 2026, 11:29pm

Just use our community Docker build - it is battle tested and optimized for two Sparks: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

As a bonus, you can use fastsafetensors as load format, which will make model loading substantially quicker.

Also, drop --attention-config.backend TRITON_ATTN and let vLLM to do it automatically.

Phaserblast · January 27, 2026, 12:10am

Thanks. I’ll give it a shot. One more thing, I found that if I use --enforce-eager, this fixes the problem. Any idea why that would make a difference?

eugr · January 27, 2026, 12:15am

Broken CUDA graphs? I don’t know what vLLM version you are using, but there were some broken builds before.

Phaserblast · January 27, 2026, 3:59am

I ended up figuring it out. I was compiling against torch 2.10.0 instead of 2.9.1 (doh!). Seems to be working fine now. Thanks for the helpful hint!

eugr · January 27, 2026, 4:10am

BTW, if you are still using your network config, you are killing your performance as it will use TCP sockets instead of IB verbs/RDMA. You will end up in slower inference on a cluster than on a single node on anything other than big dense models due to huge latency penalty.

You need to specify some extra variables, e.g.:

export MN_IF_NAME=enp1s0f1np1
export UCX_NET_DEVICES=$MN_IF_NAME
export NCCL_SOCKET_IFNAME=$MN_IF_NAME
export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1
export NCCL_IB_DISABLE=0
export GLOO_SOCKET_IFNAME=$MN_IF_NAME
export OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME
export NCCL_DEBUG=INFO
export NCCL_IGNORE_CPU_AFFINITY=1

These two lines are key:

export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1
export NCCL_IB_DISABLE=0

You need to specify both rocep1s0f1 and roceP2p1s0f1 because they are two parts of one physical NIC, each giving half of possible bandwidth, even with a single cable.

Phaserblast · January 27, 2026, 4:43pm

Thanks man, your help on this forum is gold! Is there a way I can test to confirm the connection’s parameters are correct while vllm is running? I mean, I can set the envars, but I’d like to be 100% sure it’s all correct.

I just tested everything with GLM-4.7-Flash in NVFP4, and it’s working pretty well across both nodes. My GPU use is ~90% on both, but they’re only drawing About 30W each. Usually I’ll get near 100W when running image or video generators on a single Spark. This suggests a bottleneck in the network or memory, which we are aware of, of course. Are you seeing a similar result with your setup?

eugr · January 27, 2026, 5:31pm

If you set this parameter - export NCCL_DEBUG=INFO - you can look at the startup log and check if you can see strings like this:

spark2:283:283 [0] NCCL INFO NCCL_IB_HCA set to rocep1s0f1,roceP2p1s0f1
spark2:283:283 [0] NCCL INFO NET/IB : Using [0]rocep1s0f1:1/RoCE [1]roceP2p1s0f1:1/RoCE [RO]; OOB enp1s0f1np1:192.168.177.12<0>
spark2:283:283 [0] NCCL INFO Initialized NET plugin IB
spark2:283:283 [0] NCCL INFO Assigned NET plugin IB to comm [repeated 3x across cluster]
spark2:283:283 [0] NCCL INFO Using network IB [repeated 3x across cluster]

If you don’t see Using network IB, it is not using RDMA.

spark:1208:2368 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IB/0
spark:1208:2368 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IB/1

These two lines show you that it establishes channels across both halves of the physical connection.

Yes, this is normal during inference. Inference is memory bound, so GPU is not utilized that much, unless you run batched inference on many concurrent requests. GPU is mostly utilized at the prefill stage.

GLM 4.7 Flash is not working very well currently, you’d get better performance AND better quality using MiniMax M2.1 AWQ across the cluster. Also, NVFP4 in vLLM doesn’t bring any advantages on Spark still, AWQ quants will outperform it currently (with better quality too).

Phaserblast · January 27, 2026, 9:00pm

Thanks for the suggestion. Already grabbed it from cyankiwi/MiniMax-M2.1-AWQ-4bit · Hugging Face

I was able to get GLM-4.7 (full) running in FP4 by patching glm4_moe.py in vllm. Not surprisingly, performance isn’t that great. However, I noticed my GPU usage with that model is ~95% across both Sparks, drawing up to about 45W each. I’m using Open WebUI, and can’t get vision working. vllm just complains that it’s “not a multimodal model.”

eugr · January 27, 2026, 9:04pm

You don’t need to patch anything to run GLM-4.7 (and again, I’d recommend AWQ quants). And it’s not a vision model. If you want a GLM model with vision, you need GLM-4.6V and use Transformers 5. Just try my docker build, it has support for everything there :)

Phaserblast · January 27, 2026, 9:12pm

Ah, that explains it. Gemini sure keeps telling me it is. But I’ll believe real people this time :)

I’m using the FP4 version from here: Salyut1/GLM-4.7-NVFP4 · Hugging Face

vllm errors out if some of the tensor names I think don’t match what Glm4MoeForCausalLM expects. I had to do this to get it to work: Salyut1/GLM-4.7-NVFP4 · Broken model config? This is apparently only a problem with the FP4 model. It does work after making the modification, but again, performance isn’t awesome.

Should I use QuantTrio/GLM-4.7-AWQ · Hugging Face instead?

eugr · January 27, 2026, 9:45pm

My Docker configuration supports Salyut1 version - as easy as running this. It has an issue that is unique to this particular model only, but my build includes runtime patches for this.

./launch-cluster.sh --apply-mod ./mods/fix-Salyut1-GLM-4.7-NVFP4 \
exec vllm serve Salyut1/GLM-4.7-NVFP4 \
        --attention-config.backend flashinfer \
        --tool-call-parser glm47 \
        --reasoning-parser glm45 \
        --enable-auto-tool-choice \
        -tp 2 \
        --gpu-memory-utilization 0.88 \
        --max-model-len 32000 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8000

However, QuantTrio/GLM-4.7-AWQ will run better and you won’t need any patches. Here is the command line (to run with your build, just use the vllm serve component):

./launch-cluster.sh \
        exec vllm serve QuantTrio/GLM-4.7-AWQ \
        --tool-call-parser glm47 \
        --reasoning-parser glm45 \
        --enable-auto-tool-choice \
        -tp 2 \
        --gpu-memory-utilization 0.88 \
        --max-model-len 128000 \
        --kv-cache-dtype fp8 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8888

shahizat · January 27, 2026, 10:39pm

Hello,

I found this article quite informative:

Are you selecting the RoCE version and then applying NCCL_IB_GID_INDEX? If so, how is it affecting performance? I don’t have two DGX Spark systems, but below is what I’m seeing on my ASUS Ascent GX10:

DEV     PORT    INDEX   GID                             IPv4            VER     DEV
---     ----    -----   ---                             ------------    ---     ---
rocep1s0f0      1       0       ---------------------------------------         v1      enp1s0f0np0
rocep1s0f0      1       1       ---------------------------------------         v2      enp1s0f0np0
rocep1s0f1      1       0       ---------------------------------------         v1      enp1s0f1np1
rocep1s0f1      1       1       ---------------------------------------         v2      enp1s0f1np1
roceP2p1s0f0    1       0       ---------------------------------------         v1      enP2p1s0f0np0
roceP2p1s0f0    1       1       ---------------------------------------         v2      enP2p1s0f0np0
roceP2p1s0f1    1       0       ---------------------------------------         v1      enP2p1s0f1np1
roceP2p1s0f1    1       1       ---------------------------------------         v2      enP2p1s0f1np1
n_gids_found=8

ibv_devinfo -v | grep -i roce
hca_id: rocep1s0f0
                        GID[  0]:               ::, RoCE v1
                        GID[  1]:               ::, RoCE v2
hca_id: rocep1s0f1
                        GID[  0]:               ::, RoCE v1
                        GID[  1]:               ::, RoCE v2
hca_id: roceP2p1s0f0
                        GID[  0]:               ::, RoCE v1
                        GID[  1]:               ::, RoCE v2
hca_id: roceP2p1s0f1
                        GID[  0]:               ::, RoCE v1
                        GID[  1]:               ::, RoCE v2

Phaserblast · January 27, 2026, 11:16pm

Setting NCCL_IB_GID_INDEX causes all_gather_perf to fail on my setup. Running all_gather_perf normally is saturating the connection at 24.38GB/sec using an MTU of 9000. That’s 195gbit/sec from a 200gbit/sec connection.

eugr · January 27, 2026, 11:57pm

In this article he forgets to add the second “half” of RoCE interface to NCCL_IB_HCA. If you do that, you get performance that saturates 200G in NCCL tests.

Phaserblast · January 28, 2026, 12:54am

Thanks. I tested both MiniMax-M2.1 and GLM-4.7 using those AWQ quants. Both ran fine, although GLM barely fit into memory. You’re right, I didn’t need the patch for the AWQ version. MiniMax runs faster, but I was still able to generate some decent HTML and Python code with both models.

What’s the word on proper NVFP4 support in vllm? What kind of speedup should we expect?

eugr · January 28, 2026, 1:46am

For inference, it should match or be just slightly faster than AWQ, for prefill (prompt processing) you can expect significant gains as it is GPU bound.

As for when? People are working on it independently, so hopefully soon.

martinB78 · January 28, 2026, 4:23pm

unfortunately I do not have access to a second spark.

But if, I would try Kimi K2.5 and see if they are cabal of running it locally

Phaserblast · January 28, 2026, 7:02pm

Well, this is a 1T parameter model where the native weights are already 4-bit. That’s well beyond what two Sparks can fit into RAM. Active parameters are only 32B though, same as GLM-4.7 and that runs in 4-bit on two Sparks (slowly). So I dunno, maybe 4 or 5 Sparks in a cluster can run it.

Phaserblast · January 28, 2026, 7:35pm

When I use this figure, vllm reports “Available KV cache memory: 12.83 GiB.” Does this suggest I can reduce --gpu-memory-utilization? Or should I keep the KV cache as high as possible for a larger context window? Code editing especially burns though context fast.

Topic		Replies	Views
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	1310	December 25, 2025
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	3394	December 9, 2025
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	84	2377	January 30, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	1263	December 31, 2025
How to run GLM 4.7 on dual DGX Sparks with vLLM / mods support in spark-vllm-docker DGX Spark / GB10	28	2412	January 2, 2026
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	33	1149	January 2, 2026
GLM 4.6V works on Spark! DGX Spark / GB10 Projects	12	1387	January 22, 2026
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	107	2330	January 31, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	24	985	January 11, 2026
Day 1 with DGX Spark (Asus version) DGX Spark / GB10	26	907	January 31, 2026

Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever

Related topics