Using Nsight Compute (ncu) alongside srun

Hi, I would like to use nsight compute ncu with srun but it does not seem to target the application.

I have used nsys alongside srun and it works fine. Is this feature not introduced with ncu alongside srun?

A sample command I have used is:
nsys profile --stats=true -o $outputfile srun ./myapp

Nsys works fine and was able to generate the report for all the kernel, API calls.

However when I replace nsys with ncu:
ncu --target-processes all --kernel-id ::regex:^.*${kernel}.*$:1 --set full -o ${outputfile}_ncu srun ./myapp

It manages to run the application but does not detect any kernels.
“==WARNING== No kernels were profiled.”

I tested this same command with a simple openacc vecadd cpp program and it was also unable to detect the kernel for the program when using srun. When I removed srun and ran the application directly ./ it generated the nsight compute ncu file fine.

This leads me to believe that there is no support between using ncu and srun?

I am not aware of any specific issues between srun and ncu, but I have some general questions about the workflow you try to achieve. Given that nsys works for you, it appears you are allocating a resource from the local node with srun and hence execute both the host and target processes on the same system? I wasn’t able to test this flow with srun locally so far, but launching mpirun under ncu worked without problems:

ncu --target-processes all mpirun ./CudaApp

Independent of your specific issue, you could very likely simply invert the order of ncu and srun commands to make it work, i.e. use

srun ncu --target-processes all --kernel-id ::regex:^.*${kernel}.*$:1 --set full -o ${outputfile}_ncu ./myapp

This command would even work if srun allocated on a remote node, as the ncu host process would then also be launched on that node. In your example, the host process could be on a different node than the target process, which would not work.

Hi Felix,

Thank you for your response!

Yes, so the system I am using is a supercomputer system. I am using sbatch scripts from the login node to allocate SLURM resources and use srun to run my application across those resources on compute nodes each equipped with A100 GPUs.

From examples online, I have seen ncu works fine with mpirun, unfortunately the system only allows srun as the run command.

After contacting their IT, they also suggested to invert the order. My command now looks like below:

srun -n $ranks ncu --kernel-id ::regex:'^.*update_top.*$':1 --set full -o update_top_ncu $command

I think the command is close to working, however now I receive this error:
“srun -n $ranks ncu --kernel-id ::regex:‘^.*update_top.*$’:1 --set full -o update_top_ncu $command’ resulted in 100 recursions!”

The application began to run and ncu stopped producing “no kernels were profiled”, but the application exits immediately and no further output produced.

Is this an error message from ncu?

A bit more background information if it helps:
My application is an OpenACC-Offload to GPU + MPI application. I am trying to run the application across multiple nodes where each node has 4 gpus each. I am using SLURM to allocate multiple nodes, --nasks-per-node=4, --gres=gpu:4. I then call sbatch on an sbatch script to allocate from within the login node.

I have also tried to use srun with interactive mode to allocate directly to a compute node and try to use ncu ./myapp, however, that also results in a different error, I assume it is specific to the system:
[cli_0]: write_line error; fd=8 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_0]: Unable to write to PMI_fd
[cli_0]: write_line error; fd=8 buf=:cmd=get_appnum
:
system msg for write_line failure : Bad file descriptor
Error:PMI: PMI_Get_appnum(&appnum) = -1

I think in this config, you still need to run your app under mpirun on the local system, e.g. mpirun -n X mpirun ./myapp, so that the necessary mpi environment is setup. The error message suggests to me that this is not the case.

resulted in 100 recursions

No, this doesn’t come from ncu. You can already tell by the fact that srun is part of the error message, but ncu wouldn’t know about srun. Potentially you used the wrong apostrophes? In the error, you use backtick which can be interpreted differently from single- or double-quotes on the shell. Maybe try using double-quotes for the regex?

Hi,

Did you solve this problem? I want to use Nsight compute with slurm, and I am using sbatch script too. But I can’t get the result. I would appriciate it if you could let me know how you use it.

I used the following command line in .sh file.

srun ncu --target-processes all -o report_$OMPI_COMM_WORLD_RANK python parallel.py

but when it reached send kernel, Nsight compute is keep running but it can’t profile the kernel.
I’m wondering how I can use Nsight compute with slurm on multi-node environment.
Thank you.

Hi,
I have the same problem, did you get some solutions?
My Nsight Compute just stucks at cuda library invoking. It keeps running without any output. And Nsight system also works fine.
Thank you.

Please provide more details such as the ncu version, ncu command line options used and if you see any errors or warnings when profiling. Are you also using srun?

Refer the Multi-Process Support section in the Nsight Compute CLI document.