I followed the instructoin to setup NCCL for 2 Sparks (NCCL for Two Sparks | DGX Spark), but failed on the Step 5
Testing basic SSH connectivity without password is OK.
dell@promaxgb10-0b76:~/nccl$ make -j src.build NVCC_GENCODE=“-gencode=arch=compute_121,code=sm_121”
make -C src build BUILDDIR=/home/dell/nccl/build
make[1]: 进入目录“/home/dell/nccl/src”
NVCC_GENCODE is -gencode=arch=compute_121,code=sm_121
make[2]: 进入目录“/home/dell/nccl/src/device”
NVCC_GENCODE is -gencode=arch=compute_121,code=sm_121
make[2]: 离开目录“/home/dell/nccl/src/device”
make[1]: 离开目录“/home/dell/nccl/src”
dell@promaxgb10-0b76:~$ cd nccl-tests
dell@promaxgb10-0b76:~/nccl-tests$ make MPI=1
make -C src build BUILDDIR=/home/dell/nccl-tests/build
make[1]: 进入目录“/home/dell/nccl-tests/src”
make[1]: 离开目录“/home/dell/nccl-tests/src”
dell@promaxgb10-0b76:~/nccl-tests$ ssh 169.254.164.220 hostname
promaxgb10-0843
dell@promaxgb10-0b76:~/nccl-tests$ ssh 169.254.120.36 hostname
promaxgb10-0b76
dell@promaxgb10-0b76:~/nccl-tests$ ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
dell@promaxgb10-0b76:~/nccl-tests$
dell@promaxgb10-0b76:~/nccl$ # Set network interface environment variables (use your Up interface from the previous step)
export UCX_NET_DEVICES=enp1s0f1np1
export NCCL_SOCKET_IFNAME=enp1s0f1np1
export OMPI_MCA_btl_tcp_if_include=enp1s0f1np1
Run the all_gather performance test across both nodes (replace the IP addresses with the ones you found in the previous step)
mpirun -np 2 -H 169.254.164.220:1,169.254.120.36:1
–mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no”
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH
$HOME/nccl-tests/build/all_gather_perf
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Warning: Permanently added ‘169.254.164.220’ (ED25519) to the list of known hosts.
Authorization required, but no authorization protocol specified
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: promaxgb10-0b76
PID: 47872
Message: connect() to 169.254.164.220:1024 failed
Error: Resource temporarily unavailable (11)
dell@promaxgb10-0b76:~/ncclmpirun -np 2 -H 169.254.164.220:1,169.254.120.36:1 hostnameme
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
promaxgb10-0b76
promaxgb10-0843
dell@promaxgb10-0b76:~/nccl$