Cannot profile on slurm environment


I would like to use nsight compute with slurm on multi-node environment. I am using sbatch script and I added the command like below

srun ncu --target-processes all -o report_$OMPI_COMM_WORLD_RANK python

but when it reached send(node0->node1) kernel, it cannot profile send kernel. I can’t see any error. It is just keep running but there’s no result for that kernel. Is there a way to profile on slurm environment?
(I can use nsight systems with the same way)

Thank you in advance.

1 Like

From your description, it appears that the kernels in your application are communicating across process boundaries and require such communication in order to make forward progress. In your setup, multiple instances of Nsight Compute are launched, one per rank, but these instances won’t be communicating with each other. Since you’ve chosen the default set of metrics to collect, Nsight Compute will replay each kernel multiple times (multiple “passes”) in order to collect all the required metrics. Replaying kernels that require inter-process communication to complete won’t work, as the inter-process state won’t be restored between passes, and the passes won’t be synchronized across processes.

You can try several options, but all have limitations:

  • Collect selected GPU metric sampling with Nsight Systems, which doesn’t have the same replay requirements as Nsight Compute
  • Only collect single-pass metrics in Nsight Compute, e.g. gpc__cycles_elapsed.sum. You can use the tool’s section files or its --query-metrics functionality to find available metrics. You will need to try them in order to see if they can be collected in the single pass on your GPU architecture.
  • If that doesn’t work either, collect single-pass metrics but only for a selected MPI rank by executing a wrapper script, rather than your application directly, e.g.

To profile a single rank one can use a wrapper script. The following script (called “”) profiles rank 0 only:

if [[ $OMPI_COMM_WORLD_RANK == 0 ]]; then
   ncu -o report_${OMPI_COMM_WORLD_RANK}  --target-processes all "$@"

Thank you for your response. I’ll try to collect single-pass metrics in Nsight compute. I want to collect cache-related metrics so I hope these metrics are single-pass metrics.

and I have a questions about your awnser.
I am wondering what “$@” means in your code. do i need to write “python” instead of it?
I used your code in .sh script, but It couldn’t profile it. Furthermore, on report file name, I can’t see ${OMPI_COMM_WORLD_RANK} (like report_.ncu-rep).

++ I tried “srun ncu --target-processes all --replay-mode application -o report python”. It can profile even after send kernel, and I can get report.ncu-rep file. Is this way right?

Thank you.