Hello, I want to use ncu in my cuda program.
I run the program with openmpi and nvshmem.my environment is:
docker container create by nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
Two GA100 without Nvlink
nccl 2.9.8
openmpi 4.1.1
nvshmem 2.0.3
My program is a deeplearning program using data parallelism。I use nvshmem which run in mpi to exchange data. The application shell is like:
mpirun -np 2 application args
I try to use ncu-ui 2023.2.2 in windows to profile it remotely. But the environment cannot inherit from remote.
So I use the ncu download by ncu-ui 2023.2.2 before.
I have tried two method to run it:
/var/tmp/target/linux-desktop-glibc_2_11_3-x64/ncu --config-file off --export /var/tmp/report%i --force-overwrite --target-processes all --replay-mode application --app-replay-match grid --app-replay-buffer file --app-replay-mode relaxed --launch-count 1 --section-folder /var/tmp/sections mpirun --allow-run-as-root -np 2 application args
and
mpirun --allow-run-as-root -np 2 /var/tmp/target/linux-desktop-glibc_2_11_3-x64/ncu --config-file off --export /var/tmp/report%i --force-overwrite --target-processes all --replay-mode range --launch-count 1 --section-folder /var/tmp/sections application args
But neither of these methods is feasible.
When I use the previous method, which is to start mpirun using ncu, an error is reported as follows:
==PROF== Profiling “nvshmemi_init_array_kernel” - 0 (1/1): Application replay pass 5
==ERROR== Failed to profile “nvshmemi_init_array_kernel” in process 3128
==ERROR== Failed to profile “nvshmemi_init_array_kernel” in process 3129
==PROF== Trying to shutdown target application
==ERROR== An error occurred while trying to profile.
==ERROR== Unexpected number of profiled kernels. Application replay requires that the execution, combined with selected filters, guarantees a consistent set of kernels in all passes.
==ERROR== Check the --app-replay-match option for different matching strategies.
==WARNING== No kernels were profiled.
and When using mpirun to start ncu programs separately, it will cause errors during replay:
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here’s some additional information (which may only be relevant to an
Open MPI developer):
getting local rank failed
→ Returned value No permission (-17) instead of ORTE_SUCCESS
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[24df9172edc5:03014] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
I try to use the range replay mode, but:
==PROF== Profiling “range” - 0 (1/1): ==PROF== Profiling “range” - 0 (1/1): 0%…50%…100%
==ERROR== LaunchFailed
==ERROR== Failed to profile “range” in process 3163
==PROF== Trying to shutdown target application
0%…50%…100%
==ERROR== LaunchFailed
==ERROR== Failed to profile “range” in process 3162
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No ranges were profiled.Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No ranges were profiled.
My cuda fuction is a Dist GCN, use the nvshmem to transfer the data.By the way,nsys can easily use in this program.
How to fix this problem?
Now I use the old ncu-2021.1.1.0 intstalled by apt-get.It can work directly when I use -k option.but It seems can not see the cpu call stack. And the metric in old version is less than newer version.The newest version of ncu still can not be used.