Hi everyone,
I currently following the guideline to have NCCL for Two Sparks (ASUS Ascent GX10). I followed instruction quite carefully but encountered the following issue.
On Spark 1, there is the issue of attempted connection being ignored
mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 --mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no” -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH $HOME/nccl-tests/build/all_gather_perf
Warning: Permanently added ‘192.168.100.11’ (ED25519) to the list of known hosts.
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
Local host: gx10-a424
PID: 111972
On Spark 2, I have the issue of the system not recognizing 2 CPUs
mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 \
–mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no” \
-x LD_LIBRARY_PATH \
-x UCX_NET_DEVICES=enp1s0f0np0 \
-x NCCL_SOCKET_IFNAME=enp1s0f0np0 \
-x CUDA_VISIBLE_DEVICES=0 \
$HOME/nccl-tests/build/all_gather_perf -b 33554432 -e 33554432 -n 20 -w 1 -g 1 -n 1
Warning: Permanently added ‘192.168.100.10’ (ED25519) to the list of known hosts.
# nccl-tests version 2.18.2 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 1 iters: 1 agg iters: 1 validation: 1 graph: 0 unalign: 0
#
# Using devices
.. gx10-a424 pid 111700: Test failure common.cu:1216
Invalid number of GPUs: 2 requested but only 1 were found.
Please check the number of processes and GPUs per process.
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[41441,1],1]
Exit code: 5
--------------------------------------------------------------------------
The issue does not seem to be consistent. I regularly try to clear and redo the setup. At point, the NCCL on Spark 1 seems to succeed, but Spark 2 hangs.
I wonder if folks have any best practice on how to debug this issue?