Hello,
I’m currently trying to profile the following NCCL example using Nsight Compute:
(I used “Example 2: One Device per Process or Thread”)
To do this, I followed the instructions provided on nsight compute home page.
However, when I run the program using the method shown above, Nsight Compute fails to detect any kernels and displays the following message, “==WARNING== No kernels were profiled.”
To address this, I also attempted to profile only the rank 0 process using the scripts and instructions from the Nsight Compute documentation, replacing mpirun with mpiexec. Unfortunately, this approach resulted in the program hanging during execution.
Could you please help me figure out how to successfully profile the NCCL example using Nsight Compute?
Actual scripts that I used are below
[Fail : No kernels were profiled ]
ncu --target-processes all \
--section=SpeedOfLight_RooflineChart \
-o ${NSYS_OUT} -f \
mpiexec -n ${NUM_PROCS} ${EXEC_FILE}
[Fail : Hanging issue]
mpiexec -n ${NUM_PROCS} ./wrapper_ncu_nccl.sh ${EXEC_FILE}
[wrapper_ncu_nccl.sh]
# Use MPICH, "OMPI_COMM_WORLD_SIZE" and "PMI_RANK" doesn't work
if [[ $SLURM_LOCALID == 0 ]]; then
ncu -o report_${SLURM_LOCALID} --target-processes all "$@"
else
"$@"
fi
[Fail : Hanging issue]
mpiexec -n ${NUM_PROCS} \
ncu --target-processes all \
--section=SpeedOfLight_RooflineChart \
-o ${NSYS_OUT}_${SLURM_LOCALID} -f \
${EXEC_FILE}
Additionally, I also tried to use “–replay-mode app-range” option.
I modified the code as below
cudaProfilerStart();
//initializing NCCL
NCCLCHECK(ncclCommInitRank(&comm, nRanks, id, myRank));
//communicating using NCCL
NCCLCHECK(ncclAllReduce((const void*)sendbuff, (void*)recvbuff, size, ncclFloat, ncclSum,
comm, s));
//completing NCCL operation by synchronizing on the CUDA stream
CUDACHECK(cudaStreamSynchronize(s));
cudaProfilerStop();
But ncu returns
==ERROR== Failed to create report.
==WARNING== No ranges were profiled.