Hi Team,
I’m getting this below error When I run the application as
mpirun ./main.exe
with LSF settings to run just a single MPI process, the output is
Rank=0 Size=1 - ciao!
Rank=0 Size=1 - Pointer(A)=0x146adf20
Rank=0 Size=1 - going to GPU
Rank=0 Size=1 - back from GPU
Rank=0 Size=1 - adieu!
Next I try to run under the profiler, that is
mpirun nvprof ./main.exe
and I get
FATAL ERROR: dlsym PAMI_CUDA_RegisterPAMIContexts: ./main.exe: undefined symbol: PAMI_CUDA_RegisterPAMIContexts
[p10a11:134542] Error: common_pami.c:1019 - ompi_common_pami_init() Unable to create PAMI client (rc=1)
No components were able to be opened in the pml framework.
This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.
Host: p10a11
Framework: pml
[p10a11:134542] PML pami cannot be selected
======== Error: Application returned non-zero code 1
so I figured maybe it was because I left the -gpu out, that is I should run with
mpirun -gpu nvprof ./main.exe
but this gives what looks like the same error
FATAL ERROR: dlsym PAMI_CUDA_RegisterPAMIContexts: ./main.exe: undefined symbol: PAMI_CUDA_RegisterPAMIContexts
[p10a07:159447] Error: common_pami.c:1019 - ompi_common_pami_init() Unable to create PAMI client (rc=1)
No components were able to be opened in the pml framework.
This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.
Host: p10a07
Framework: pml
[p10a07:159447] PML pami cannot be selected
So maybe I can try with MXM instead,
mpirun -mxm nvprof ./main.exe
That gives us
[1546632219.717627] [p10a05:108539:0] mxm.c:196 MXM WARN The ‘ulimit -s’ on the system is set to ‘unlimited’. This may have negative performance implications. Please set the stack size to the default value (10240)
FATAL ERROR: dlsym PAMI_CUDA_RegisterPAMIContexts: ./main.exe: undefined symbol: PAMI_CUDA_RegisterPAMIContexts
Rank=0 Size=1 - ciao!
Rank=0 Size=1 - Pointer(A)=0x32895420
Rank=0 Size=1 - going to GPU
==108539== NVPROF is profiling process 108539, command: ./main.exe
======== Profiling result:
No kernels were profiled.
No API activities were profiled.
======== Error: incompatible CUDA driver version.
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[33186,1],0]
Exit code: 19
OS - redhat7.5-alternate, architecture - ppc64le, nvidia driver 410.79 and cuda 10 installed on all machines.