vLLM on dual sparks

Has anyone managed to get the example Install and Use vLLM for Inference | DGX Spark
to work on dual sparks?

I have followed the tutorial but when I try to get the 70B model to load it says there is not enough memory. It seems as if it can’t see the worker Spark here is the output of ray list nodes. 10.168.1.128 is the IP address of the head node’s 10 GB port.

======== List: 2025-11-14 11:31:17.340771 ========
Stats:

Total: 2

Table:

NODE_ID                                                   NODE_IP         IS_HEAD_NODE    STATE    STATE_MESSAGE    NODE_NAME       RESOURCES_TOTAL                 LABELS

0 43eb5919fef04013b0f0eff0e04f31781c31073cf70aa693b0d29c15 10.168.1.128 True ALIVE 10.168.1.128 CPU: 20.0 ray.io/accelerator-type: GB10
GPU: 1.0 ray.io/node-id: 43eb5919fef04013b0f0eff0e04f31781c31073cf70aa693b0d29c15
accelerator_type:GB10: 1.0
memory: 108.718 GiB
node:10.168.1.128: 1.0
node:internal_head: 1.0
object_store_memory: 9.728 GiB
1 5a299ad3678429ac998ca9c1d7ea58527dd8e57ffc877fbd7d286e0d 192.168.100.11 False ALIVE 192.168.100.11 CPU: 20.0 ray.io/accelerator-type: GB10
GPU: 1.0 ray.io/node-id: 5a299ad3678429ac998ca9c1d7ea58527dd8e57ffc877fbd7d286e0d
accelerator_type:GB10: 1.0
memory: 109.849 GiB
node:192.168.100.11: 1.0
object_store_memory: 9.728 GiB

I have managed to get the 8B model to work but obviously that fits on a single device.
I am not sure why it has the head node as 10.168.1.128 rather than 192.168.100.10

The problem seems to be that the head node was defaulting to the address of the 10 Gb port.

To fix it I kludged the run_cluster.sh file to explicitly specify the ray head node ip address (in my case 192.168.100.10 )

if [ “${NODE_TYPE}” == “–head” ]; then
RAY_START_CMD+=" --head --node-ip-address=192.168.100.10 --port=6379"
else
RAY_START_CMD+=" --address=${HEAD_NODE_ADDRESS}:6379"

I still have to figure out why the worker keeps downloading the weights of the model from huggingface each time it runs.

This article was very useful: Connecting Two DGX Spark Systems via 200Gb/s RoCE Network for Multi-Node GPU Training | by Doran Gao | Oct, 2025 | Medium

2 Likes

Look into this thread. There’s a ton of information and insights and code about using vLLM and Ray on stacked sparks, including some improvements on loading times:

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.