Successful 2 DGX Spark cluster setup?

esvehla · October 21, 2025, 2:29am

Has anyone been successful setting up a 2 Spark cluster? If so please share what instructions you followed to get it going. Thank you!

NVES · October 21, 2025, 2:42am

Please reference cluster instructions: Spark Clustering — DGX Spark User Guide

Let us know if you run into issues.

esvehla · October 21, 2025, 3:03am

Yep I tried that but scripts like ./discover-sparks are not there? I was able to get the ssh functionality working but was not successful gettin NCCL to work by following the instructions. Was getting errors ‘Authorization required, but no authorization protocol specified’

alan.dang · October 21, 2025, 3:12am

github.com/NVIDIA/dgx-spark-playbooks

nvidia/connect-two-sparks/assets/discover-sparks

main

#
# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
#!/bin/env bash

# discover-sparks
# Discover available systems using avahi-browse and generate MPI hosts file

This file has been truncated. show original

The discover-sparks.sh is available there.

What I am unclear of, beyond the playbook is what I should expect from dual DGX Sparks.

Does RDMA over Converged Ethernet mean that I can just setup the two DGX Sparks as described and then “magically” something like LM Studio will see both Sparks? Or do I have to do more custom command line Llama.cpp/TensorRT-LLM work to split a model across two units.

NVES · October 21, 2025, 3:12am

not sure i understand, the script referenced by Spark Clustering — DGX Spark User Guide is available here dgx-spark-playbooks/nvidia/connect-two-sparks/assets/discover-sparks at main · NVIDIA/dgx-spark-playbooks · GitHub

esvehla · October 21, 2025, 3:23am

Thanks for the point the discover-sparks.sh script, any insights on my Authorization required, but no authorization protocol specified? I was running:

Set network interface environment variables (use your Up interface from the previous step)

export UCX_NET_DEVICES= enp1s0f1np1

export NCCL_SOCKET_IFNAME= enp1s0f1np1

export OMPI_MCA_btl_tcp_if_include= enp1s0f1np1

# Run the all_gather performance test across both nodes (replace the IP addresses with the ones you found in the previous step)

mpirun -np 2 -H 169.254.57.235:1, 169.254.240.92:1 \

–mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no” \

-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \

$HOME/nccl-tests/build/all_gather_perf

NVES · October 21, 2025, 3:48am

RoCE lets computers transfer data directly between each other’s memory without involving the CPU much. It’s a protocol. You’ll need the application layer (eg. LM Studio) to support and utilize the improved bandwidth between nodes so you can utilize compute and memory on multiple nodes seamlessly. TRT-LLM is an example of such support.

NVES · October 21, 2025, 3:50am

please share the runtime log

esvehla · October 21, 2025, 4:32am

Sure here you go, note: I was able to complete the connect to Sparks playbook successfully:

export UCX_NET_DEVICES=enp1s0f0np0

export NCCL_SOCKET_IFNAME=enp1s0f0np10

export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0

# Run the all_gather performance test across both nodes (replace the IP addresses with the ones you found in the previous step)

mpirun -np 2 -H 1169.254.240.92/16:1 169.254.57.235:1 \

–mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no” \

-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \

$HOME/nccl-tests/build/all_gather_perf

Authorization required, but no authorization protocol specified

ssh: connect to host 0.0.4.145 port 22: Connection timed out

--------------------------------------------------------------------------

ORTE was unable to reliably start one or more daemons.

This usually is caused by:

* not finding the required libraries and/or binaries on

one or more nodes. Please check your PATH and LD_LIBRARY_PATH

settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.

Please verify your allocation and authorities.

* the inability to write startup files into /tmp (–tmpdir/orte_tmpdir_base).

Please check with your sys admin to determine the correct location to use.

* compilation of the orted with dynamic libraries when static are required

(e.g., on Cray). Please check your configure cmd line and consider using

one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a

lack of common network interfaces and/or no route found between

them. Please check network connectivity (including firewalls

and network routing requirements).

--------------------------------------------------------------------------

NVES · October 21, 2025, 4:37am

May have a typo?

mpirun -np 2 -H 1169.254.240.92/16:1

1169?

alan.dang · October 21, 2025, 4:42am

Thanks. So what I am hearing is that the DGX Spark clustering is just normal 200GbE clustering as opposed to some sort of fancy Thunderbolt/Oculink eGPU style connection where the host OS just sees a second GPU thanks to fancy software trickery. Instead, it’s just normal NCCL that moves data direct to GPU memory (which doesn’t matter much for GB10 with unified memory but would matter for other GPUs.)

So, I could also take a desktop PC running Windows and WSL Ubuntu with a RTX Pro 6000 Blackwell, and then put in:

MCX753436MC-HEAB

And then connect the PC to the Spark via

MCP1650-H001E30

and then using TensorRT-LLM or vLLM, I have the ability to run a “small” model at ultra fast speeds entirely on the GB202 RTX 5090 or RTX Pro 6000. But when I exceed 32/96GB, instead of offloading to a CPU, I would offload to the Spark, essentially following the playbook since it’s still the same ConnectX-7 setup?

Then I can sort of run lower context windows at maximum speed when it fits in 96GB but have the flexibility to add another 128GB of slower performing compute via the DGX Spark (which is still much faster than offloading to the CPU)?

Since the DGX Spark has two 200GbE ports, if I use two cables, does bandwidth automatically increase to 400GbE?

esvehla · October 21, 2025, 12:32pm

I got the cluster working after alot of diagnosis work, all NCCL tests are running successfully

system · November 4, 2025, 12:33pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Enabling GPU Direct RDMA for DGX Spark Clustering DGX Spark / GB10 gpu	11	530	December 11, 2025
ConnectX-7 NIC in DGX Spark DGX Spark / GB10	66	1675	December 2, 2025
Adding node performance DGX Spark / GB10	5	137	December 15, 2025
Suggested cable to link two Sparks? DGX Spark / GB10	77	2309	December 8, 2025
DGX Spark Clustering DGX Spark / GB10	3	1483	August 13, 2025
DGX Spark Cluster larger than 2 nodes DGX Spark / GB10	2	102	November 7, 2025
What I've learned so far as a "non-tech" Day 1 DGX Spark adopter DGX Spark / GB10	26	2351	November 13, 2025
How do I run vLLM inference on a DGX Spark system using two ConnectX-7 NICs? DGX Spark / GB10	10	436	December 22, 2025
How to connect two Nvidia DGX Spark processors in series? DGX Spark / GB10	5	197	November 25, 2025
Any plans to add a second Connect-X7 port to serial stack multiple DGX Spark clusters? DGX Spark / GB10	16	2148	November 17, 2025

Successful 2 DGX Spark cluster setup?

Set network interface environment variables (use your Up interface from the previous step)

Related topics