Hello,
We are currently working on making the GPU-enabled version of an application dynamically loadable from Python, which implies bundling GPU-enabled code in a shared library that is dlopen
’d by an executable (Python) that is not linked against any OpenACC or OpenMP runtime libraries.
We have got this working, with some restrictions, in isolation, but recently came across a new issue. On our machines, importing numpy
before dynamically loading our application leads to a segfault:
$ cat shared.cpp
extern "C" void func() {}
$ nvc++ -acc -gpu=rdc -shared -o libshared.so shared.cpp
$ python -c 'import ctypes, numpy; ctypes.CDLL("./libshared.so")'
Segmentation fault
$ python -c 'import ctypes; ctypes.CDLL("./libshared.so"); import numpy'
<no error>
the backtrace shows:
$ gdb --args python -c 'import ctypes, numpy; ctypes.CDLL("./libshared.so")'
…
(gdb) bt 8
#0 0x00007fffdfa8937a in __ompt_load_return_address (gtid=<optimized out>) at ../../src/ompt-specific.h:90
#1 __kmpc_critical_with_hint (loc=0x0, global_tid=-1, crit=0x7fffd6789680 <smallmem_lock>, hint=0) at ../../src/kmp_csupport.cpp:1468
#2 0x00007fffd6579b42 in __pgi_uacc_smallmem (n=24) at ../../src/smallmem.c:43
#3 0x00007fffd5ff6ad5 in __pgi_uacc_cuda_load_pic_module (pic_pgi_cuda_loc=0x7fffd678e400 <__PGI_CUDA_LOC>, pic_pgi_cuda_cap=0x7fffd678e430 <__PGI_CUDA_CAP>) at ../../src/cuda_init.c:1836
#4 0x00007fffd678b0d7 in __pgi_uacc_set_shared () from ./libshared.so
#5 0x00007fffd678b01f in _init () from ./libshared.so
#6 0x00007fffffffb928 in ?? ()
#7 0x00007fffed8f297f in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
(More stack frames follow…)
and the issue appears to be that our numpy
loads intel-mkl
:
$ cat main.cpp
#include <dlfcn.h>
#include <stdexcept>
using func_t = void(*)();
int main() {
void* h = dlopen("./libshared.so", RTLD_NOW);
if(!h) { throw std::runtime_error(dlerror()); }
auto* func = reinterpret_cast<func_t>(dlsym(h, "func"));
if(!func) { throw std::runtime_error(dlerror()); }
func();
return 0;
}
$ g++ -ldl -o main main.cpp
$ ./main
<no error>
$ LD_PRELOAD=/path/to/intel-mkl-2020.4.304-rzr3hj/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin/libiomp5.so gdb ./main
...
Program received signal SIGSEGV, Segmentation fault.
0x00007fffed55637a in __ompt_load_return_address (gtid=<optimized out>) at ../../src/ompt-specific.h:90
in this standalone example then using -gpu=nordc
avoids the problem; we are still working on some outstanding issues with using -gpu=nordc
in our real application.
The issue took quite some effort to debug, and given the popularity of numpy
it may well come up again. Hopefully this post will help other users.
We imagine that the issue might be related to some underlying LLVM code being used by both intel-mkl
and nvc++
in incompatible ways, but that is only a suspicion. If this can be made more robust in future then that would obviously be great.