Troubles with NCCL connecting 2 DGX Sparks

sonphamorg · April 4, 2026, 2:41am

Hi everyone,

I currently following the guideline to have NCCL for Two Sparks (ASUS Ascent GX10). I followed instruction quite carefully but encountered the following issue.

On Spark 1, there is the issue of attempted connection being ignored

mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 --mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no” -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH $HOME/nccl-tests/build/all_gather_perf

Warning: Permanently added ‘192.168.100.11’ (ED25519) to the list of known hosts.

WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

Local host: gx10-a424
PID: 111972

On Spark 2, I have the issue of the system not recognizing 2 CPUs

mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 \

–mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no” \

-x LD_LIBRARY_PATH \

-x UCX_NET_DEVICES=enp1s0f0np0 \

-x NCCL_SOCKET_IFNAME=enp1s0f0np0 \

-x CUDA_VISIBLE_DEVICES=0 \

$HOME/nccl-tests/build/all_gather_perf -b 33554432 -e 33554432 -n 20 -w 1 -g 1 -n 1

Warning: Permanently added ‘192.168.100.10’ (ED25519) to the list of known hosts.

# nccl-tests version 2.18.2 nccl-headers=22809 nccl-library=22809

# Collective test starting: all_gather_perf

# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 1 iters: 1 agg iters: 1 validation: 1 graph: 0 unalign: 0

#

# Using devices

.. gx10-a424 pid 111700: Test failure common.cu:1216

Invalid number of GPUs: 2 requested but only 1 were found.

Please check the number of processes and GPUs per process.

--------------------------------------------------------------------------

Primary job terminated normally, but 1 process returned

a non-zero exit code. Per user-direction, the job has been aborted.

--------------------------------------------------------------------------

mpirun detected that one or more processes exited with non-zero status, thus causing

the job to be terminated. The first process to do so was:

Process name: [[41441,1],1]

Exit code: 5

--------------------------------------------------------------------------

The issue does not seem to be consistent. I regularly try to clear and redo the setup. At point, the NCCL on Spark 1 seems to succeed, but Spark 2 hangs.

I wonder if folks have any best practice on how to debug this issue?

Topic		Replies	Views
Can't stack DGX Sparks - HELP DGX Spark / GB10	10	324	April 21, 2026
Failed to Run NCCL communication test DGX Spark / GB10	1	121	November 30, 2025
NCCL For 2 Sparks Setup - Errors? DGX Spark / GB10 spark	6	397	December 23, 2025
Error in "NCCL for Two Sparks" Playbook DGX Spark / GB10 test , installation	5	186	January 8, 2026
Successful 2 DGX Spark cluster setup? DGX Spark / GB10	12	3217	October 21, 2025
Collective operations timeout on dual spark during distributed training DGX Spark / GB10 pytorch , spark	1	91	April 10, 2026
Why is my NCCL broken? DGX Spark / GB10	26	478	February 19, 2026
NCCL socket transport fails with pipeline parallelism (mesh_pp) on DGX Spark DGX Spark / GB10 pytorch , nemo , parallel-computing , llama	5	288	January 3, 2026
DGX Spark ↔ EdgeXpert NCCL only ~17 GB/s over 200GbE DGX Spark / GB10	5	272	April 9, 2026
Test the sample about "Connect Three DGX Spark in a Ring Topology" DGX Spark / GB10 cuda	15	507	April 13, 2026

Troubles with NCCL connecting 2 DGX Sparks

mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 --mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no” -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH $HOME/nccl-tests/build/all_gather_perf

Related topics