Nsight Compute with MPI: ‘No Kernels Were Profiled’ Warning and Hanging Issue

Hello,
I’m currently trying to profile the following NCCL example using Nsight Compute:

(I used “Example 2: One Device per Process or Thread”)

To do this, I followed the instructions provided on nsight compute home page.

However, when I run the program using the method shown above, Nsight Compute fails to detect any kernels and displays the following message, “==WARNING== No kernels were profiled.”

To address this, I also attempted to profile only the rank 0 process using the scripts and instructions from the Nsight Compute documentation, replacing mpirun with mpiexec. Unfortunately, this approach resulted in the program hanging during execution.

Could you please help me figure out how to successfully profile the NCCL example using Nsight Compute?

Actual scripts that I used are below

[Fail : No kernels were profiled ]

ncu --target-processes all \
     --section=SpeedOfLight_RooflineChart \
     -o ${NSYS_OUT} -f \
     mpiexec -n ${NUM_PROCS} ${EXEC_FILE}

[Fail : Hanging issue]

mpiexec -n ${NUM_PROCS} ./wrapper_ncu_nccl.sh ${EXEC_FILE}

[wrapper_ncu_nccl.sh]
# Use MPICH, "OMPI_COMM_WORLD_SIZE" and "PMI_RANK" doesn't work
if [[ $SLURM_LOCALID == 0 ]]; then
   ncu -o report_${SLURM_LOCALID}  --target-processes all "$@"
else
   "$@"
fi

[Fail : Hanging issue]

mpiexec -n ${NUM_PROCS} \
    ncu --target-processes all \
        --section=SpeedOfLight_RooflineChart \
        -o ${NSYS_OUT}_${SLURM_LOCALID} -f \
        ${EXEC_FILE}

Additionally, I also tried to use “–replay-mode app-range” option.
I modified the code as below

  cudaProfilerStart();

  //initializing NCCL
  NCCLCHECK(ncclCommInitRank(&comm, nRanks, id, myRank));

  //communicating using NCCL
  NCCLCHECK(ncclAllReduce((const void*)sendbuff, (void*)recvbuff, size, ncclFloat, ncclSum,
        comm, s));

  //completing NCCL operation by synchronizing on the CUDA stream
  CUDACHECK(cudaStreamSynchronize(s));

  cudaProfilerStop();

But ncu returns
==ERROR== Failed to create report.
==WARNING== No ranges were profiled.

Nsight Compute does presently not support profiling applications that launch mandatory concurrent kernels from different processes (i.e., kernels that must run concurrently in order to make forward progress). You would need to change your application to spawn all kernels belonging to the same NCCL AllReduce from a single process. As a workaround, you can resort to collecting GPU performance metrics (and NCCL API call information) using Nsight Systems with less detail.

1 Like

Thank you for your reply!