Nsight compute can not profile when use openmpi and nvshmem in multi-gpus

Hello, I want to use ncu in my cuda program.
I run the program with openmpi and nvshmem.my environment is:

docker container create by nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
Two GA100 without Nvlink
nccl 2.9.8
openmpi 4.1.1
nvshmem 2.0.3

My program is a deeplearning program using data parallelism。I use nvshmem which run in mpi to exchange data. The application shell is like:

mpirun -np 2 application args

I try to use ncu-ui 2023.2.2 in windows to profile it remotely. But the environment cannot inherit from remote.
So I use the ncu download by ncu-ui 2023.2.2 before.
I have tried two method to run it:

/var/tmp/target/linux-desktop-glibc_2_11_3-x64/ncu --config-file off --export /var/tmp/report%i --force-overwrite --target-processes all --replay-mode application --app-replay-match grid --app-replay-buffer file --app-replay-mode relaxed --launch-count 1 --section-folder /var/tmp/sections mpirun --allow-run-as-root -np 2 application args

and

mpirun --allow-run-as-root -np 2 /var/tmp/target/linux-desktop-glibc_2_11_3-x64/ncu --config-file off --export /var/tmp/report%i --force-overwrite --target-processes all --replay-mode range --launch-count 1 --section-folder /var/tmp/sections application args

But neither of these methods is feasible.
When I use the previous method, which is to start mpirun using ncu, an error is reported as follows:

==PROF== Profiling “nvshmemi_init_array_kernel” - 0 (1/1): Application replay pass 5
==ERROR== Failed to profile “nvshmemi_init_array_kernel” in process 3128
==ERROR== Failed to profile “nvshmemi_init_array_kernel” in process 3129
==PROF== Trying to shutdown target application
==ERROR== An error occurred while trying to profile.
==ERROR== Unexpected number of profiled kernels. Application replay requires that the execution, combined with selected filters, guarantees a consistent set of kernels in all passes.
==ERROR== Check the --app-replay-match option for different matching strategies.
==WARNING== No kernels were profiled.

and When using mpirun to start ncu programs separately, it will cause errors during replay:


It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here’s some additional information (which may only be relevant to an
Open MPI developer):
getting local rank failed

→ Returned value No permission (-17) instead of ORTE_SUCCESS


*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[24df9172edc5:03014] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

I try to use the range replay mode, but:

==PROF== Profiling “range” - 0 (1/1): ==PROF== Profiling “range” - 0 (1/1): 0%…50%…100%

==ERROR== LaunchFailed
==ERROR== Failed to profile “range” in process 3163
==PROF== Trying to shutdown target application
0%…50%…100%

==ERROR== LaunchFailed
==ERROR== Failed to profile “range” in process 3162
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No ranges were profiled.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No ranges were profiled.

My cuda fuction is a Dist GCN, use the nvshmem to transfer the data.By the way,nsys can easily use in this program.
How to fix this problem?

Now I use the old ncu-2021.1.1.0 intstalled by apt-get.It can work directly when I use -k option.but It seems can not see the cpu call stack. And the metric in old version is less than newer version.The newest version of ncu still can not be used.

Hi, @728882065

Thanks for using Nsight Compute !
So you mean you can profile successfully with an old version 2021.1.1.0, but can’t with 2023.2.2.

Which command you use is successful ?
Which driver do you use ?
Is it possible to provide us a mini-repo ?

By the way, we recently have a new version 2023.3.1 released, can you check if the issue still exists on this version. Thanks !