Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever

I’ve got a two-Spark cluster running okay with a very fast connection (24.38 GB/sec using MTU 9000), and trying to run vLLM in tensor parallel mode. I am running gpt-oss-20b. This setup works fine with the vllm option --pipeline-parallel-size 2, but that doesn’t use both GPUs to 100%. Tensor parallelism mode does utilize both GPUs to 100%, but the problem I have is after the first prompt is processed by the model, one node drops out while the other spins its GPU at 100% indefinitely until I kill vllm. Sometimes it just stops in the middle of processing the first prompt and hangs. Here are my vllm launch args:

vllm serve $MODEL_PATH \
    --distributed-executor-backend ray \
    --tensor-parallel-size 2 \
    --pipeline-parallel-size 1 \
    --load-format auto \
    --gpu-memory-utilization 0.7 \
    --attention-config.backend TRITON_ATTN \
    --disable-log-stats \
    --dtype auto \
    --trust-remote-code \
    --max-model-len 8192 \
    --max-num-batched-tokens 8192 \
    --served-model-name gpt-oss-20b

I launch ray using the recommended method, nothing fancy, setting all the envars to the appropriate IF name:

export IF_NAME=enp1s0f1np1
export NCCL_SOCKET_IFNAME=$IF_NAME
export OMPI_MCA_btl_tcp_if_include=$IF_NAME
export UCX_NET_DEVICES=$IF_NAME
export TP_SOCKET_IFNAME=$IF_NAME
export GLOO_SOCKET_IFNAME=$IF_NAME

Again, this works okay if I launch vllm with --tensor-parallel-size 1 --pipeline-parallel-size 2.

Any ideas?

Just use our community Docker build - it is battle tested and optimized for two Sparks: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

As a bonus, you can use fastsafetensors as load format, which will make model loading substantially quicker.

Also, drop --attention-config.backend TRITON_ATTN and let vLLM to do it automatically.

Thanks. I’ll give it a shot. One more thing, I found that if I use --enforce-eager, this fixes the problem. Any idea why that would make a difference?

Broken CUDA graphs? I don’t know what vLLM version you are using, but there were some broken builds before.

I ended up figuring it out. I was compiling against torch 2.10.0 instead of 2.9.1 (doh!). Seems to be working fine now. Thanks for the helpful hint!

1 Like

BTW, if you are still using your network config, you are killing your performance as it will use TCP sockets instead of IB verbs/RDMA. You will end up in slower inference on a cluster than on a single node on anything other than big dense models due to huge latency penalty.

You need to specify some extra variables, e.g.:

export MN_IF_NAME=enp1s0f1np1
export UCX_NET_DEVICES=$MN_IF_NAME
export NCCL_SOCKET_IFNAME=$MN_IF_NAME
export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1
export NCCL_IB_DISABLE=0
export GLOO_SOCKET_IFNAME=$MN_IF_NAME
export OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME
export NCCL_DEBUG=INFO
export NCCL_IGNORE_CPU_AFFINITY=1

These two lines are key:

export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1
export NCCL_IB_DISABLE=0

You need to specify both rocep1s0f1 and roceP2p1s0f1 because they are two parts of one physical NIC, each giving half of possible bandwidth, even with a single cable.

2 Likes

Thanks man, your help on this forum is gold! Is there a way I can test to confirm the connection’s parameters are correct while vllm is running? I mean, I can set the envars, but I’d like to be 100% sure it’s all correct.

I just tested everything with GLM-4.7-Flash in NVFP4, and it’s working pretty well across both nodes. My GPU use is ~90% on both, but they’re only drawing About 30W each. Usually I’ll get near 100W when running image or video generators on a single Spark. This suggests a bottleneck in the network or memory, which we are aware of, of course. Are you seeing a similar result with your setup?

If you set this parameter - export NCCL_DEBUG=INFO - you can look at the startup log and check if you can see strings like this:

spark2:283:283 [0] NCCL INFO NCCL_IB_HCA set to rocep1s0f1,roceP2p1s0f1
spark2:283:283 [0] NCCL INFO NET/IB : Using [0]rocep1s0f1:1/RoCE [1]roceP2p1s0f1:1/RoCE [RO]; OOB enp1s0f1np1:192.168.177.12<0>
spark2:283:283 [0] NCCL INFO Initialized NET plugin IB
spark2:283:283 [0] NCCL INFO Assigned NET plugin IB to comm [repeated 3x across cluster]
spark2:283:283 [0] NCCL INFO Using network IB [repeated 3x across cluster]

If you don’t see Using network IB, it is not using RDMA.

spark:1208:2368 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IB/0
spark:1208:2368 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IB/1

These two lines show you that it establishes channels across both halves of the physical connection.

Yes, this is normal during inference. Inference is memory bound, so GPU is not utilized that much, unless you run batched inference on many concurrent requests. GPU is mostly utilized at the prefill stage.

GLM 4.7 Flash is not working very well currently, you’d get better performance AND better quality using MiniMax M2.1 AWQ across the cluster. Also, NVFP4 in vLLM doesn’t bring any advantages on Spark still, AWQ quants will outperform it currently (with better quality too).

Thanks for the suggestion. Already grabbed it from cyankiwi/MiniMax-M2.1-AWQ-4bit · Hugging Face

I was able to get GLM-4.7 (full) running in FP4 by patching glm4_moe.py in vllm. Not surprisingly, performance isn’t that great. However, I noticed my GPU usage with that model is ~95% across both Sparks, drawing up to about 45W each. I’m using Open WebUI, and can’t get vision working. vllm just complains that it’s “not a multimodal model.”

You don’t need to patch anything to run GLM-4.7 (and again, I’d recommend AWQ quants). And it’s not a vision model. If you want a GLM model with vision, you need GLM-4.6V and use Transformers 5. Just try my docker build, it has support for everything there :)

Ah, that explains it. Gemini sure keeps telling me it is. But I’ll believe real people this time :)

I’m using the FP4 version from here: Salyut1/GLM-4.7-NVFP4 · Hugging Face

vllm errors out if some of the tensor names I think don’t match what Glm4MoeForCausalLM expects. I had to do this to get it to work: Salyut1/GLM-4.7-NVFP4 · Broken model config? This is apparently only a problem with the FP4 model. It does work after making the modification, but again, performance isn’t awesome.

Should I use QuantTrio/GLM-4.7-AWQ · Hugging Face instead?

My Docker configuration supports Salyut1 version - as easy as running this. It has an issue that is unique to this particular model only, but my build includes runtime patches for this.

./launch-cluster.sh --apply-mod ./mods/fix-Salyut1-GLM-4.7-NVFP4 \
exec vllm serve Salyut1/GLM-4.7-NVFP4 \
        --attention-config.backend flashinfer \
        --tool-call-parser glm47 \
        --reasoning-parser glm45 \
        --enable-auto-tool-choice \
        -tp 2 \
        --gpu-memory-utilization 0.88 \
        --max-model-len 32000 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8000

However, QuantTrio/GLM-4.7-AWQ will run better and you won’t need any patches. Here is the command line (to run with your build, just use the vllm serve component):

./launch-cluster.sh \
        exec vllm serve QuantTrio/GLM-4.7-AWQ \
        --tool-call-parser glm47 \
        --reasoning-parser glm45 \
        --enable-auto-tool-choice \
        -tp 2 \
        --gpu-memory-utilization 0.88 \
        --max-model-len 128000 \
        --kv-cache-dtype fp8 \
        --distributed-executor-backend ray \
        --host 0.0.0.0 \
        --port 8888

Hello,

I found this article quite informative:

Are you selecting the RoCE version and then applying NCCL_IB_GID_INDEX? If so, how is it affecting performance? I don’t have two DGX Spark systems, but below is what I’m seeing on my ASUS Ascent GX10:

DEV     PORT    INDEX   GID                             IPv4            VER     DEV
---     ----    -----   ---                             ------------    ---     ---
rocep1s0f0      1       0       ---------------------------------------         v1      enp1s0f0np0
rocep1s0f0      1       1       ---------------------------------------         v2      enp1s0f0np0
rocep1s0f1      1       0       ---------------------------------------         v1      enp1s0f1np1
rocep1s0f1      1       1       ---------------------------------------         v2      enp1s0f1np1
roceP2p1s0f0    1       0       ---------------------------------------         v1      enP2p1s0f0np0
roceP2p1s0f0    1       1       ---------------------------------------         v2      enP2p1s0f0np0
roceP2p1s0f1    1       0       ---------------------------------------         v1      enP2p1s0f1np1
roceP2p1s0f1    1       1       ---------------------------------------         v2      enP2p1s0f1np1
n_gids_found=8
ibv_devinfo -v | grep -i roce
hca_id: rocep1s0f0
                        GID[  0]:               ::, RoCE v1
                        GID[  1]:               ::, RoCE v2
hca_id: rocep1s0f1
                        GID[  0]:               ::, RoCE v1
                        GID[  1]:               ::, RoCE v2
hca_id: roceP2p1s0f0
                        GID[  0]:               ::, RoCE v1
                        GID[  1]:               ::, RoCE v2
hca_id: roceP2p1s0f1
                        GID[  0]:               ::, RoCE v1
                        GID[  1]:               ::, RoCE v2

Setting NCCL_IB_GID_INDEX causes all_gather_perf to fail on my setup. Running all_gather_perf normally is saturating the connection at 24.38GB/sec using an MTU of 9000. That’s 195gbit/sec from a 200gbit/sec connection.

In this article he forgets to add the second “half” of RoCE interface to NCCL_IB_HCA. If you do that, you get performance that saturates 200G in NCCL tests.

Thanks. I tested both MiniMax-M2.1 and GLM-4.7 using those AWQ quants. Both ran fine, although GLM barely fit into memory. You’re right, I didn’t need the patch for the AWQ version. MiniMax runs faster, but I was still able to generate some decent HTML and Python code with both models.

What’s the word on proper NVFP4 support in vllm? What kind of speedup should we expect?

For inference, it should match or be just slightly faster than AWQ, for prefill (prompt processing) you can expect significant gains as it is GPU bound.

As for when? People are working on it independently, so hopefully soon.

unfortunately I do not have access to a second spark.

But if, I would try Kimi K2.5 and see if they are cabal of running it locally

Well, this is a 1T parameter model where the native weights are already 4-bit. That’s well beyond what two Sparks can fit into RAM. Active parameters are only 32B though, same as GLM-4.7 and that runs in 4-bit on two Sparks (slowly). So I dunno, maybe 4 or 5 Sparks in a cluster can run it.

When I use this figure, vllm reports “Available KV cache memory: 12.83 GiB.” Does this suggest I can reduce --gpu-memory-utilization? Or should I keep the KV cache as high as possible for a larger context window? Code editing especially burns though context fast.