How do I run vLLM inference on a DGX Spark system using two ConnectX-7 NICs?

Hi, I’m trying to run vLLM inference on two DGX Spark units connected with dual ConnectX-7 NICs, but I’m stuck even after following the official NVIDIA guides. I’d really appreciate some help understanding what I’m missing.


1) Remote SSH access between the two DGX Spark systems

I successfully established remote SSH between the two DGX Spark machines using the following guide:
https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks

Everything in this section worked perfectly.

Here is the NIC status:

user@spark-xxx1:~$ ibdev2netdev
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)

And SSH between the two systems:

user@spark-xxx1:~$ ssh <dgx spark 1> hostname
spark-xxx1
user@spark-xxx1:~$ ssh <dgx spark 2> hostname
spark-xxx2

Up to this point, everything worked smoothly.
However, the real problems start afterwards.


2) Ray cluster setup

I followed the next guide for building the Ray cluster:
https://build.nvidia.com/spark/vllm/stacked-sparks

The issue begins at Step 5 of the Ray cluster setup (Head node + Worker node).

  • Head node (Node 1): (runs fine)

user@spark-xxx1:~$ export MN_IF_NAME=enP2p1s0f1np1
user@spark-xxx1:~$ bash run_cluster.sh $VLLM_IMAGE 192.168.100.10 --head ~/.cache/huggingface
-e VLLM_HOST_IP=192.168.100.10
-e UCX_NET_DEVICES=$MN_IF_NAME
-e NCCL_SOCKET_IFNAME=$MN_IF_NAME
-e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME
-e GLOO_SOCKET_IFNAME=$MN_IF_NAME
-e TP_SOCKET_IFNAME=$MN_IF_NAME
-e RAY_memory_monitor_refresh_ms=0
-e MASTER_ADDR=192.168.100.10

2025-11-17 00:36:31,337 INFO usage_lib.py:473 – Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add --disable-usage-stats to the command that starts the cluster, or run the following command: ray disable-usage-stats before starting the cluster. See Usage Stats Collection — Ray 3.0.0.dev0 for more details.
2025-11-17 00:36:31,337 INFO scripts.py:913 – Local node IP: 172.xxxxxx1
2025-11-17 00:36:34,080 SUCC scripts.py:949 – --------------------
2025-11-17 00:36:34,080 SUCC scripts.py:950 – Ray runtime started.
2025-11-17 00:36:34,080 SUCC scripts.py:951 – --------------------

  • Worker node (Node 2): (cannot join the cluster properly)

user@spark-xxx2:~$ export MN_IF_NAME=enP2p1s0f1np1
user@spark-xxx2:~$ bash run_cluster.sh $VLLM_IMAGE 192.168.100.10 --worker ~/.cache/huggingface
-e VLLM_HOST_IP=192.168.100.11
-e UCX_NET_DEVICES=$MN_IF_NAME
-e NCCL_SOCKET_IFNAME=$MN_IF_NAME
-e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME
-e GLOO_SOCKET_IFNAME=$MN_IF_NAME
-e TP_SOCKET_IFNAME=$MN_IF_NAME
-e RAY_memory_monitor_refresh_ms=0
-e MASTER_ADDR=192.168.100.10

[2025-11-17 00:40:09,666 W 1 1] gcs_rpc_client.h:155: Failed to connect to GCS at address 192.168.100.10:6379 within 5 seconds.

Even with help from GPT and others, I tried alternative ways to connect the worker node to the Ray cluster.
Sometimes I can make the worker join, but vLLM inference still does NOT share GPUs across nodes.

This is where I get stuck.


3) vLLM inference

After fixing Step 5 (Ray cluster part), ray status correctly shows the combined GPU memory:

128 GB × 2 = 256 GB (approx…. 218gb!)
So it looks like Ray is detecting both DGX Spark units.

However, from Step 8 (vLLM model inference) in the guide, nothing works as described.
The vLLM server either:

  • does not start on the worker,

  • or cannot access GPUs on the other node,

  • or hangs during distributed initialization.

I feel like I am missing something fundamental, but I am not familiar with distributed computing, so I can’t tell what the root cause is.

I thought I followed the guide very carefully, but I can’t get vLLM inference to utilize GPUs across both DGX Spark nodes.


PS — Important question

From the NVIDIA guide, it sounds like both DGX Spark systems (Head + Worker) should each run a vLLM serving command during distributed inference.

Is that correct?
Do I need to run the vLLM server on both nodes simultaneously?

If possible, could someone clarify how vLLM serving is supposed to work in a dual-node DGX Spark environment?


Thanks in advance. I really appreciate any help you can provide.

  1. When setting up the stacked sparks, did you use the automatic IP configuration method or the manual method?
  2. When testing connectivity between the Sparks, are you using the IP for the CX7 NICs?
  3. You only need the run the servecommand on the head node.

First, thank you for responding.

  1. I used automatic IP allocation exactly as the guide instructed.

  2. I followed the guide, but while it says to assign temporary IPs starting with 192… to .10 and .11, this did not work at all. I then tried using the head node’s address (the automatically assigned IP from step 1), but as in step 8, it still picks up the physical address of the head node. For example, automatic IP allocation gave an IP starting with 160, but the actual Ray cluster uses a physical address starting with 170. I have tried this multiple times, and since other users report the same problem, it seems the ray_cluster.sh script itself has a fundamental flaw.

  3. The guide’s instructions were unclear, so I assumed both methods were required. I will attempt this again after resolving point 2.

Check this thread: Suggested cable to link two Sparks? - #33 by raphael.amorim

For step 4 of the vLLM playbook, In the command where you run the cluster, you must use the IP address assigned to the CX7 NIC. You can use ip a to find it assigned to enp1s0f1np1 or enp1s0f1np0 interface. If the command in step 4 does not bring up the head node at the right ip address you may need to edit the run_cluster.sh file to add --node-ip-address=${HEAD_NODE_ADDRESS} to the RAY_START_CMD where the HEAD_NODE_ADDRESS is the address assigned to the head node CX7 NIC

As I expected, the problem was with ray_cluster.sh: since it kept looking for a different IP address, I explicitly assigned one, and then LLaMA 3.3 70B ran successfully. However, strangely, even though ray status shows 2/2 GPUs are available, the used GPU memory is reported as 0B/218.57GiB. What does this mean?

I will look into this and reach back out

Mark Ramsey’s playbook just works and takes care of most things in scripts:

Hi oosijj, the reason the ray status shows 218.67GiB of memory available instead of the whole 256GB is due to the object memory reserved by ray, which can be modified using e.g., --object-store-memory=78643200. Some memory will always stay allocated for object store due to a min allowed value, but you can reclaim most of that 22GB gap (256GB - 234GB).

The vLLM run_cluster.sh script has been updated with the relevant fix