Has anyone been successful setting up a 2 Spark cluster? If so please share what instructions you followed to get it going. Thank you!
Please reference cluster instructions: Spark Clustering — DGX Spark User Guide
Let us know if you run into issues.
Yep I tried that but scripts like ./discover-sparks are not there? I was able to get the ssh functionality working but was not successful gettin NCCL to work by following the instructions. Was getting errors ‘Authorization required, but no authorization protocol specified’
The discover-sparks.sh is available there.
What I am unclear of, beyond the playbook is what I should expect from dual DGX Sparks.
Does RDMA over Converged Ethernet mean that I can just setup the two DGX Sparks as described and then “magically” something like LM Studio will see both Sparks? Or do I have to do more custom command line Llama.cpp/TensorRT-LLM work to split a model across two units.
not sure i understand, the script referenced by Spark Clustering — DGX Spark User Guide is available here dgx-spark-playbooks/nvidia/connect-two-sparks/assets/discover-sparks at main · NVIDIA/dgx-spark-playbooks · GitHub
Thanks for the point the discover-sparks.sh script, any insights on my Authorization required, but no authorization protocol specified? I was running:
Set network interface environment variables (use your Up interface from the previous step)
export UCX_NET_DEVICES= enp1s0f1np1
export NCCL_SOCKET_IFNAME= enp1s0f1np1
export OMPI_MCA_btl_tcp_if_include= enp1s0f1np1
# Run the all_gather performance test across both nodes (replace the IP addresses with the ones you found in the previous step)
mpirun -np 2 -H 169.254.57.235:1, 169.254.240.92:1 \
–mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no” \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
$HOME/nccl-tests/build/all_gather_perf
RoCE lets computers transfer data directly between each other’s memory without involving the CPU much. It’s a protocol. You’ll need the application layer (eg. LM Studio) to support and utilize the improved bandwidth between nodes so you can utilize compute and memory on multiple nodes seamlessly. TRT-LLM is an example of such support.
please share the runtime log
Sure here you go, note: I was able to complete the connect to Sparks playbook successfully:
export UCX_NET_DEVICES=enp1s0f0np0
export NCCL_SOCKET_IFNAME=enp1s0f0np10
export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0
# Run the all_gather performance test across both nodes (replace the IP addresses with the ones you found in the previous step)
mpirun -np 2 -H 1169.254.240.92/16:1 169.254.57.235:1 \
–mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no” \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
$HOME/nccl-tests/build/all_gather_perf
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
ssh: connect to host 0.0.4.145 port 22: Connection timed out
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (–tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
May have a typo?
mpirun -np 2 -H 1169.254.240.92/16:1
1169?
Thanks. So what I am hearing is that the DGX Spark clustering is just normal 200GbE clustering as opposed to some sort of fancy Thunderbolt/Oculink eGPU style connection where the host OS just sees a second GPU thanks to fancy software trickery. Instead, it’s just normal NCCL that moves data direct to GPU memory (which doesn’t matter much for GB10 with unified memory but would matter for other GPUs.)
So, I could also take a desktop PC running Windows and WSL Ubuntu with a RTX Pro 6000 Blackwell, and then put in:
MCX753436MC-HEAB
And then connect the PC to the Spark via
MCP1650-H001E30
and then using TensorRT-LLM or vLLM, I have the ability to run a “small” model at ultra fast speeds entirely on the GB202 RTX 5090 or RTX Pro 6000. But when I exceed 32/96GB, instead of offloading to a CPU, I would offload to the Spark, essentially following the playbook since it’s still the same ConnectX-7 setup?
Then I can sort of run lower context windows at maximum speed when it fits in 96GB but have the flexibility to add another 128GB of slower performing compute via the DGX Spark (which is still much faster than offloading to the CPU)?
Since the DGX Spark has two 200GbE ports, if I use two cables, does bandwidth automatically increase to 400GbE?
I got the cluster working after alot of diagnosis work, all NCCL tests are running successfully
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.