After done some cuda related programming, I wanted to delve into nccl programming but literature seems far and few. I am starting here but can not even get started with code snippets:
thz, I will try. I actually asked AI and came up pretty good examples. So installed nccl-dev, mpi libraries and compiled OK however getting some run time error. WIll provide details shortly.
if [[ -f $FILENAME.out ]] ; then
LD_LIBRARY_PATH=/usr/lib64/openmpi/lib/ OMPI_ALLOW_RUN_AS_ROOT=1 OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
mpirun -np 4 ./$FILENAME.out -allow-run-as-root 2>&1 | tee $LODIR/$FILENAME.run.log
fi
log:
/usr/bin/ld: /tmp/tmpxft_000003b4_00000000-5_ex-1-create-comm.o: in function MPI::Intracomm::Intracomm()': ex-1-create-comm.cpp:(.text._ZN3MPI9IntracommC2Ev[_ZN3MPI9IntracommC5Ev]+0x14): undefined reference to MPI::Comm::Comm()’
/usr/bin/ld: /tmp/tmpxft_000003b4_00000000-5_ex-1-create-comm.o: in function MPI::Intracomm::Intracomm(ompi_communicator_t*)': ex-1-create-comm.cpp:(.text._ZN3MPI9IntracommC2EP19ompi_communicator_t[_ZN3MPI9IntracommC5EP19ompi_communicator_t]+0x19): undefined reference to MPI::Comm::Comm()’
/usr/bin/ld: /tmp/tmpxft_000003b4_00000000-5_ex-1-create-comm.o: in function MPI::Op::Init(void (*)(void const*, void*, int, MPI::Datatype const&), bool)': ex-1-create-comm.cpp:(.text._ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb[_ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb]+0x24): undefined reference to ompi_mpi_cxx_op_intercept’
/usr/bin/ld: /tmp/tmpxft_000003b4_00000000-5_ex-1-create-comm.o:(.rodata._ZTVN3MPI3WinE[_ZTVN3MPI3WinE]+0x48): undefined reference to MPI::Win::Free()' /usr/bin/ld: /tmp/tmpxft_000003b4_00000000-5_ex-1-create-comm.o:(.rodata._ZTVN3MPI8DatatypeE[_ZTVN3MPI8DatatypeE]+0x78): undefined reference to MPI::Datatype::Free()’
collect2: error: ld returned 1 exit status
That’s a compile-time error. You’re missing (proper/correct) MPI C++ bindings. Yes, I can see you have -lmpi_cxx, it may be an ordering issue, or a problem with your specific MPI. I won’t be able to sort it out for you. You should make sure you are using a properly built MPI such as the one in the HPC SDK. AFAIK nccl expects CUDA-aware MPI, with all that that implies.
turns out i already got it working. Now I am getting truly a either cuda or mpi issue after defeating few more runtime errors.
Here myrank is getting garbaged just during call to ncclCommInitRankConfig().
I instrumented with some debugging code and seeing myrank is getting updated through:
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
shortly after that, it is printing out correctly, 0 and 1 respectively for 2-GPU system:
++ tee log/ex-1-create-comm.run.log
myRank: 1.
nRanks: 2.
localRank: 1.
cudaSetDevice… with localRank: 1
myRank: 0.
nRanks: 2.
localRank: 0.
cudaSetDevice… with localRank: 0
MPI_Bcast…
MPI_Bcast…
Strange thing is untill then ncclCommInitRankConfig, there is no code that is updating the myrank but now just before calling this function, one of the myrank becomes 128: