Hi, I’m trying to run vLLM inference on two DGX Spark units connected with dual ConnectX-7 NICs, but I’m stuck even after following the official NVIDIA guides. I’d really appreciate some help understanding what I’m missing.
1) Remote SSH access between the two DGX Spark systems
I successfully established remote SSH between the two DGX Spark machines using the following guide:
https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks
Everything in this section worked perfectly.
Here is the NIC status:
user@spark-xxx1:~$ ibdev2netdev
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
And SSH between the two systems:
user@spark-xxx1:~$ ssh <dgx spark 1> hostname
spark-xxx1
user@spark-xxx1:~$ ssh <dgx spark 2> hostname
spark-xxx2
Up to this point, everything worked smoothly.
However, the real problems start afterwards.
2) Ray cluster setup
I followed the next guide for building the Ray cluster:
https://build.nvidia.com/spark/vllm/stacked-sparks
The issue begins at Step 5 of the Ray cluster setup (Head node + Worker node).
- Head node (Node 1): (runs fine)
user@spark-xxx1:~$ export MN_IF_NAME=enP2p1s0f1np1
user@spark-xxx1:~$ bash run_cluster.sh $VLLM_IMAGE 192.168.100.10 --head ~/.cache/huggingface
-e VLLM_HOST_IP=192.168.100.10
-e UCX_NET_DEVICES=$MN_IF_NAME
-e NCCL_SOCKET_IFNAME=$MN_IF_NAME
-e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME
-e GLOO_SOCKET_IFNAME=$MN_IF_NAME
-e TP_SOCKET_IFNAME=$MN_IF_NAME
-e RAY_memory_monitor_refresh_ms=0
-e MASTER_ADDR=192.168.100.10
2025-11-17 00:36:31,337 INFO usage_lib.py:473 – Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add --disable-usage-stats to the command that starts the cluster, or run the following command: ray disable-usage-stats before starting the cluster. See Usage Stats Collection — Ray 3.0.0.dev0 for more details.
2025-11-17 00:36:31,337 INFO scripts.py:913 – Local node IP: 172.xxxxxx1
2025-11-17 00:36:34,080 SUCC scripts.py:949 – --------------------
2025-11-17 00:36:34,080 SUCC scripts.py:950 – Ray runtime started.
2025-11-17 00:36:34,080 SUCC scripts.py:951 – --------------------
- Worker node (Node 2): (cannot join the cluster properly)
user@spark-xxx2:~$ export MN_IF_NAME=enP2p1s0f1np1
user@spark-xxx2:~$ bash run_cluster.sh $VLLM_IMAGE 192.168.100.10 --worker ~/.cache/huggingface
-e VLLM_HOST_IP=192.168.100.11
-e UCX_NET_DEVICES=$MN_IF_NAME
-e NCCL_SOCKET_IFNAME=$MN_IF_NAME
-e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME
-e GLOO_SOCKET_IFNAME=$MN_IF_NAME
-e TP_SOCKET_IFNAME=$MN_IF_NAME
-e RAY_memory_monitor_refresh_ms=0
-e MASTER_ADDR=192.168.100.10
[2025-11-17 00:40:09,666 W 1 1] gcs_rpc_client.h:155: Failed to connect to GCS at address 192.168.100.10:6379 within 5 seconds.
Even with help from GPT and others, I tried alternative ways to connect the worker node to the Ray cluster.
Sometimes I can make the worker join, but vLLM inference still does NOT share GPUs across nodes.
This is where I get stuck.
3) vLLM inference
After fixing Step 5 (Ray cluster part), ray status correctly shows the combined GPU memory:
128 GB × 2 = 256 GB (approx…. 218gb!)
So it looks like Ray is detecting both DGX Spark units.
However, from Step 8 (vLLM model inference) in the guide, nothing works as described.
The vLLM server either:
-
does not start on the worker,
-
or cannot access GPUs on the other node,
-
or hangs during distributed initialization.
I feel like I am missing something fundamental, but I am not familiar with distributed computing, so I can’t tell what the root cause is.
I thought I followed the guide very carefully, but I can’t get vLLM inference to utilize GPUs across both DGX Spark nodes.
PS — Important question
From the NVIDIA guide, it sounds like both DGX Spark systems (Head + Worker) should each run a vLLM serving command during distributed inference.
Is that correct?
Do I need to run the vLLM server on both nodes simultaneously?
If possible, could someone clarify how vLLM serving is supposed to work in a dual-node DGX Spark environment?
Thanks in advance. I really appreciate any help you can provide.