Nsight Compute with MPI: ‘No Kernels Were Profiled’ Warning and Hanging Issue

dhwlghksy · March 23, 2025, 10:42pm

Hello,
I’m currently trying to profile the following NCCL example using Nsight Compute:

(I used “Example 2: One Device per Process or Thread”)

To do this, I followed the instructions provided on nsight compute home page.

However, when I run the program using the method shown above, Nsight Compute fails to detect any kernels and displays the following message, “==WARNING== No kernels were profiled.”

To address this, I also attempted to profile only the rank 0 process using the scripts and instructions from the Nsight Compute documentation, replacing mpirun with mpiexec. Unfortunately, this approach resulted in the program hanging during execution.

Could you please help me figure out how to successfully profile the NCCL example using Nsight Compute?

Actual scripts that I used are below

[Fail : No kernels were profiled ]

ncu --target-processes all \
     --section=SpeedOfLight_RooflineChart \
     -o ${NSYS_OUT} -f \
     mpiexec -n ${NUM_PROCS} ${EXEC_FILE}

[Fail : Hanging issue]

mpiexec -n ${NUM_PROCS} ./wrapper_ncu_nccl.sh ${EXEC_FILE}

[wrapper_ncu_nccl.sh]
# Use MPICH, "OMPI_COMM_WORLD_SIZE" and "PMI_RANK" doesn't work
if [[ $SLURM_LOCALID == 0 ]]; then
   ncu -o report_${SLURM_LOCALID}  --target-processes all "$@"
else
   "$@"
fi

[Fail : Hanging issue]

mpiexec -n ${NUM_PROCS} \
    ncu --target-processes all \
        --section=SpeedOfLight_RooflineChart \
        -o ${NSYS_OUT}_${SLURM_LOCALID} -f \
        ${EXEC_FILE}

Additionally, I also tried to use “–replay-mode app-range” option.
I modified the code as below

  cudaProfilerStart();

  //initializing NCCL
  NCCLCHECK(ncclCommInitRank(&comm, nRanks, id, myRank));

  //communicating using NCCL
  NCCLCHECK(ncclAllReduce((const void*)sendbuff, (void*)recvbuff, size, ncclFloat, ncclSum,
        comm, s));

  //completing NCCL operation by synchronizing on the CUDA stream
  CUDACHECK(cudaStreamSynchronize(s));

  cudaProfilerStop();

But ncu returns
==ERROR== Failed to create report.
==WARNING== No ranges were profiled.

felix_dt · March 24, 2025, 11:26am

Nsight Compute does presently not support profiling applications that launch mandatory concurrent kernels from different processes (i.e., kernels that must run concurrently in order to make forward progress). You would need to change your application to spawn all kernels belonging to the same NCCL AllReduce from a single process. As a workaround, you can resort to collecting GPU performance metrics (and NCCL API call information) using Nsight Systems with less detail.

dhwlghksy · March 24, 2025, 5:53pm

Thank you for your reply!

Topic		Replies	Views
Question about profiling nccl kernels with Nsight Compute Nsight Compute	20	4832	February 13, 2025
Nsight Compute not reporting/profiling all kernels profiled by Nsight Systems Nsight Compute	9	566	March 27, 2024
Nsight-Compute returns “No kernels were profiled” warning Nsight Compute	9	1437	July 27, 2023
Option to profile only master process Nsight Compute cuda	23	3532	December 1, 2023
Nsight Compute Fails To Profile Kernels on WSL Windows11 Nsight Compute	4	713	April 15, 2024
Application GUI freezes after NSIGHT Compute profiler is connected Nsight Compute	11	1301	April 12, 2023
Compute CLI hangs when profiling PyTorch application Nsight Compute	8	1815	August 6, 2019
Nsight compute hanging issue Nsight Compute kernel	7	853	March 11, 2024
Cannot profile CUDA kernel using NC : Run Bottleneck returned an error Nsight Compute	4	524	October 12, 2021
Nsight Compute does not detect kernel launches for OpenMP offloaded code Nsight Compute profiling	11	1552	February 28, 2023

Nsight Compute with MPI: ‘No Kernels Were Profiled’ Warning and Hanging Issue

Related topics