I’ve got a two-Spark cluster running okay with a very fast connection (24.38 GB/sec using MTU 9000), and trying to run vLLM in tensor parallel mode. I am running gpt-oss-20b. This setup works fine with the vllm option --pipeline-parallel-size 2, but that doesn’t use both GPUs to 100%. Tensor parallelism mode does utilize both GPUs to 100%, but the problem I have is after the first prompt is processed by the model, one node drops out while the other spins its GPU at 100% indefinitely until I kill vllm. Sometimes it just stops in the middle of processing the first prompt and hangs. Here are my vllm launch args:
BTW, if you are still using your network config, you are killing your performance as it will use TCP sockets instead of IB verbs/RDMA. You will end up in slower inference on a cluster than on a single node on anything other than big dense models due to huge latency penalty.
You need to specify both rocep1s0f1 and roceP2p1s0f1 because they are two parts of one physical NIC, each giving half of possible bandwidth, even with a single cable.
Thanks man, your help on this forum is gold! Is there a way I can test to confirm the connection’s parameters are correct while vllm is running? I mean, I can set the envars, but I’d like to be 100% sure it’s all correct.
I just tested everything with GLM-4.7-Flash in NVFP4, and it’s working pretty well across both nodes. My GPU use is ~90% on both, but they’re only drawing About 30W each. Usually I’ll get near 100W when running image or video generators on a single Spark. This suggests a bottleneck in the network or memory, which we are aware of, of course. Are you seeing a similar result with your setup?
If you set this parameter - export NCCL_DEBUG=INFO - you can look at the startup log and check if you can see strings like this:
spark2:283:283 [0] NCCL INFO NCCL_IB_HCA set to rocep1s0f1,roceP2p1s0f1
spark2:283:283 [0] NCCL INFO NET/IB : Using [0]rocep1s0f1:1/RoCE [1]roceP2p1s0f1:1/RoCE [RO]; OOB enp1s0f1np1:192.168.177.12<0>
spark2:283:283 [0] NCCL INFO Initialized NET plugin IB
spark2:283:283 [0] NCCL INFO Assigned NET plugin IB to comm [repeated 3x across cluster]
spark2:283:283 [0] NCCL INFO Using network IB [repeated 3x across cluster]
If you don’t see Using network IB, it is not using RDMA.
spark:1208:2368 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IB/0
spark:1208:2368 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IB/1
These two lines show you that it establishes channels across both halves of the physical connection.
Yes, this is normal during inference. Inference is memory bound, so GPU is not utilized that much, unless you run batched inference on many concurrent requests. GPU is mostly utilized at the prefill stage.
GLM 4.7 Flash is not working very well currently, you’d get better performance AND better quality using MiniMax M2.1 AWQ across the cluster. Also, NVFP4 in vLLM doesn’t bring any advantages on Spark still, AWQ quants will outperform it currently (with better quality too).
I was able to get GLM-4.7 (full) running in FP4 by patching glm4_moe.py in vllm. Not surprisingly, performance isn’t that great. However, I noticed my GPU usage with that model is ~95% across both Sparks, drawing up to about 45W each. I’m using Open WebUI, and can’t get vision working. vllm just complains that it’s “not a multimodal model.”
You don’t need to patch anything to run GLM-4.7 (and again, I’d recommend AWQ quants). And it’s not a vision model. If you want a GLM model with vision, you need GLM-4.6V and use Transformers 5. Just try my docker build, it has support for everything there :)
vllm errors out if some of the tensor names I think don’t match what Glm4MoeForCausalLM expects. I had to do this to get it to work: Salyut1/GLM-4.7-NVFP4 · Broken model config? This is apparently only a problem with the FP4 model. It does work after making the modification, but again, performance isn’t awesome.
My Docker configuration supports Salyut1 version - as easy as running this. It has an issue that is unique to this particular model only, but my build includes runtime patches for this.
However, QuantTrio/GLM-4.7-AWQ will run better and you won’t need any patches. Here is the command line (to run with your build, just use the vllm serve component):
Are you selecting the RoCE version and then applying NCCL_IB_GID_INDEX? If so, how is it affecting performance? I don’t have two DGX Spark systems, but below is what I’m seeing on my ASUS Ascent GX10:
Setting NCCL_IB_GID_INDEX causes all_gather_perf to fail on my setup. Running all_gather_perf normally is saturating the connection at 24.38GB/sec using an MTU of 9000. That’s 195gbit/sec from a 200gbit/sec connection.
In this article he forgets to add the second “half” of RoCE interface to NCCL_IB_HCA. If you do that, you get performance that saturates 200G in NCCL tests.
Thanks. I tested both MiniMax-M2.1 and GLM-4.7 using those AWQ quants. Both ran fine, although GLM barely fit into memory. You’re right, I didn’t need the patch for the AWQ version. MiniMax runs faster, but I was still able to generate some decent HTML and Python code with both models.
What’s the word on proper NVFP4 support in vllm? What kind of speedup should we expect?
For inference, it should match or be just slightly faster than AWQ, for prefill (prompt processing) you can expect significant gains as it is GPU bound.
As for when? People are working on it independently, so hopefully soon.
Well, this is a 1T parameter model where the native weights are already 4-bit. That’s well beyond what two Sparks can fit into RAM. Active parameters are only 32B though, same as GLM-4.7 and that runs in 4-bit on two Sparks (slowly). So I dunno, maybe 4 or 5 Sparks in a cluster can run it.
When I use this figure, vllm reports “Available KV cache memory: 12.83 GiB.” Does this suggest I can reduce --gpu-memory-utilization? Or should I keep the KV cache as high as possible for a larger context window? Code editing especially burns though context fast.