Troubles with NCCL connecting 2 DGX Sparks

Hi everyone,

I currently following the guideline to have NCCL for Two Sparks (ASUS Ascent GX10). I followed instruction quite carefully but encountered the following issue.

On Spark 1, there is the issue of attempted connection being ignored

mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 --mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no” -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH $HOME/nccl-tests/build/all_gather_perf

Warning: Permanently added ‘192.168.100.11’ (ED25519) to the list of known hosts.

WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

Local host: gx10-a424
PID: 111972

On Spark 2, I have the issue of the system not recognizing 2 CPUs

mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 \

–mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no” \

-x LD_LIBRARY_PATH \

-x UCX_NET_DEVICES=enp1s0f0np0 \

-x NCCL_SOCKET_IFNAME=enp1s0f0np0 \

-x CUDA_VISIBLE_DEVICES=0 \

$HOME/nccl-tests/build/all_gather_perf -b 33554432 -e 33554432 -n 20 -w 1 -g 1 -n 1

Warning: Permanently added ‘192.168.100.10’ (ED25519) to the list of known hosts.

# nccl-tests version 2.18.2 nccl-headers=22809 nccl-library=22809

# Collective test starting: all_gather_perf

# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 1 iters: 1 agg iters: 1 validation: 1 graph: 0 unalign: 0

#

# Using devices

.. gx10-a424 pid 111700: Test failure common.cu:1216

Invalid number of GPUs: 2 requested but only 1 were found.

Please check the number of processes and GPUs per process.

--------------------------------------------------------------------------

Primary job terminated normally, but 1 process returned

a non-zero exit code. Per user-direction, the job has been aborted.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

mpirun detected that one or more processes exited with non-zero status, thus causing

the job to be terminated. The first process to do so was:

Process name: [[41441,1],1]

Exit code: 5

--------------------------------------------------------------------------

The issue does not seem to be consistent. I regularly try to clear and redo the setup. At point, the NCCL on Spark 1 seems to succeed, but Spark 2 hangs.

I wonder if folks have any best practice on how to debug this issue?