NVPROF- Error: incompatible CUDA driver version.

vsolasa · January 4, 2019, 9:12pm

Hi Team,

I’m getting this below error When I run the application as

mpirun ./main.exe
with LSF settings to run just a single MPI process, the output is
Rank=0 Size=1 - ciao!
Rank=0 Size=1 - Pointer(A)=0x146adf20
Rank=0 Size=1 - going to GPU
Rank=0 Size=1 - back from GPU
Rank=0 Size=1 - adieu!
Next I try to run under the profiler, that is
mpirun nvprof ./main.exe
and I get
FATAL ERROR: dlsym PAMI_CUDA_RegisterPAMIContexts: ./main.exe: undefined symbol: PAMI_CUDA_RegisterPAMIContexts
[p10a11:134542] Error: common_pami.c:1019 - ompi_common_pami_init() Unable to create PAMI client (rc=1)

No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

Host: p10a11
Framework: pml

[p10a11:134542] PML pami cannot be selected
======== Error: Application returned non-zero code 1
so I figured maybe it was because I left the -gpu out, that is I should run with
mpirun -gpu nvprof ./main.exe
but this gives what looks like the same error
FATAL ERROR: dlsym PAMI_CUDA_RegisterPAMIContexts: ./main.exe: undefined symbol: PAMI_CUDA_RegisterPAMIContexts
[p10a07:159447] Error: common_pami.c:1019 - ompi_common_pami_init() Unable to create PAMI client (rc=1)

No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

Host: p10a07
Framework: pml

[p10a07:159447] PML pami cannot be selected
So maybe I can try with MXM instead,
mpirun -mxm nvprof ./main.exe
That gives us
[1546632219.717627] [p10a05:108539:0] mxm.c:196 MXM WARN The ‘ulimit -s’ on the system is set to ‘unlimited’. This may have negative performance implications. Please set the stack size to the default value (10240)
FATAL ERROR: dlsym PAMI_CUDA_RegisterPAMIContexts: ./main.exe: undefined symbol: PAMI_CUDA_RegisterPAMIContexts
Rank=0 Size=1 - ciao!
Rank=0 Size=1 - Pointer(A)=0x32895420
Rank=0 Size=1 - going to GPU
==108539== NVPROF is profiling process 108539, command: ./main.exe
======== Profiling result:
No kernels were profiled.
No API activities were profiled.
======== Error: incompatible CUDA driver version.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[33186,1],0]
Exit code: 19

OS - redhat7.5-alternate, architecture - ppc64le, nvidia driver 410.79 and cuda 10 installed on all machines.

vsolasa · January 4, 2019, 11:04pm

One of our application engineers has looked at another aspect of this problem: the failure of Spectrum MPI’s PAMI libraries when profiling with nvprof. Below is his analysis.

The problem is that ‘nvprof’ is overriding the LD_PRELOAD that we set for the PAMI cudahooks library, which PAMI needs for correctness when handling CUDA buffers.

[b6p056zc@p10a11 nvprof-issue] mpirun -gpu env | grep LD_PRELOAD
OMPI_LD_PRELOAD_POSTPEND_DISTRO=/gpfs/gpfs_gl4_16mb/smpi/10.2.0.9/lib/libpami_cudahook.so
OMPI_MCA_mca_base_env_list_distro=MPI_ROOT,OPAL_PREFIX,OPAL_LIBDIR,PMIX_INSTALL_PREFIX,HWLOC_PLUGINS_PATH,PAMI_DISABLE_IPC,PAMI_IBV_DISABLE_RRMW,HCOLL_ALLREDUCE_ZCOPY_TUNE,OMPI_LD_PRELOAD_POSTPEND_DISTRO,LD_LIBRARY_PATH
LD_PRELOAD=/gpfs/gpfs_gl4_16mb/smpi/10.2.0.9/lib/libpami_cudahook.so
[b6p056zc@p10a11 nvprof-issue] mpirun -gpu nvprof env | grep LD_PRELOAD
OMPI_LD_PRELOAD_POSTPEND_DISTRO=/gpfs/gpfs_gl4_16mb/smpi/10.2.0.9/lib/libpami_cudahook.so
OMPI_MCA_mca_base_env_list_distro=MPI_ROOT,OPAL_PREFIX,OPAL_LIBDIR,PMIX_INSTALL_PREFIX,HWLOC_PLUGINS_PATH,PAMI_DISABLE_IPC,PAMI_IBV_DISABLE_RRMW,HCOLL_ALLREDUCE_ZCOPY_TUNE,OMPI_LD_PRELOAD_POSTPEND_DISTRO,LD_LIBRARY_PATH
LD_PRELOAD=libaccinj64.so.10.0
======== Warning: No profile data collected.

Since PAMI detects that the libpami_cudahook.so is not loaded, it’s failing out with the message you are seeing.

I’ve been tinkering and cannot seem to get it to work without wrapping the application. Here is my wrapper script

#=============================== cat wrap-me.sh
#!/bin/bash
if [[ “x” != “x$LD_PRELOAD” && “x” != “x$OMPI_LD_PRELOAD_POSTPEND_DISTRO” ]] ; then
if [ “$LD_PRELOAD” != “$OMPI_LD_PRELOAD_POSTPEND_DISTRO” ] ; then
export LD_PRELOAD=“$OMPI_LD_PRELOAD_POSTPEND_DISTRO $LD_PRELOAD”
fi
fi
echo "=====> "$LD_PRELOAD
exec $@
#===============================

Using that I can get a bit further:
[b6p056zc@p10a11 nvprof-issue] mpirun -gpu nvprof $PWD/wrap-me.sh ./main.exe
=====> /gpfs/gpfs_gl4_16mb/smpi/10.2.0.9/lib/libpami_cudahook.so libaccinj64.so.10.0
Rank=0 Size=1 - ciao!
Rank=0 Size=1 - Pointer(A)=0x34900610
Rank=0 Size=1 - going to GPU
==117570== NVPROF is profiling process 117570, command: ./main.exe
======== Profiling result:
No kernels were profiled.
No API activities were profiled.
======== Error: incompatible CUDA driver version.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[55998,1],0]
Exit code: 19

I suspect that the CUDA driver version error is maybe something to do with the binary? You can also try reordering the LD_PRELOAD in the script to put 'nvprof’s first:
export LD_PRELOAD=“$LD_PRELOAD $OMPI_LD_PRELOAD_POSTPEND_DISTRO”

Also verified this link and tried the root authority by giving sudo privileges, still getting the same error.

/usr/local/cuda-10.0/bin/nvprof
[1546963843.411570] [p10a11:162153:0] mxm.c:196 MXM WARN The ‘ulimit -s’ on the system is set to ‘unlimited’. This may have negative performance implications. Please set the stack size to the default value (10240)
FATAL ERROR: dlsym PAMI_CUDA_RegisterPAMIContexts: ./main.exe: undefined symbol: PAMI_CUDA_RegisterPAMIContexts
Rank=0 Size=1 - ciao!
Rank=0 Size=1 - Pointer(A)=0x2b165420
Rank=0 Size=1 - going to GPU
==162153== NVPROF is profiling process 162153, command: ./main.exe
======== Error: incompatible CUDA driver version.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[26814,1],0]
Exit code: 19

[root@p10a73 OpenACCMPIProfilingTest]# ’

Topic		Replies	Views
NVPROF with Error: incompatible CUDA driver version. Visual Profiler and nvprof	1	1498	January 3, 2020
nvprof incompatible CUDA drivers CUDA Setup and Installation	5	3180	March 5, 2015
Nvprof Legacy PGI Compilers	2	3269	January 13, 2020
Incompatible CUDA driver version Visual Profiler and nvprof cuda	2	1604	July 29, 2021
"Error: incompatible CUDA driver version" with nvprof, nvvp (RPM CUDA installation) CUDA Programming and Performance	2	3500	February 13, 2015
Question about nvprof : incompatible CUDA driver version CUDA Programming and Performance	5	1268	August 18, 2018
Bug: nvprof driver version (Solved) Jetson AGX Xavier	5	2090	December 10, 2018
nvprof is not working in Jetson Xavier (Solved) Jetson AGX Xavier	8	3861	November 16, 2018
nvprof: incompatible CUDA driver version on TX2 Jetson TX2	11	3388	March 8, 2018
about mpirun + nvprof profiling Visual Profiler and nvprof	1	1230	August 26, 2019

NVPROF- Error: incompatible CUDA driver version.

Host: p10a11 Framework: pml

Host: p10a07 Framework: pml

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[55998,1],0] Exit code: 19

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[26814,1],0] Exit code: 19

Related topics

Host: p10a11
Framework: pml

Host: p10a07
Framework: pml

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[55998,1],0]
Exit code: 19

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[26814,1],0]
Exit code: 19