Nvc++ OpenACC runtime segfaults if Intel MKL (numpy) is already loaded

Hello,

We are currently working on making the GPU-enabled version of an application dynamically loadable from Python, which implies bundling GPU-enabled code in a shared library that is dlopen’d by an executable (Python) that is not linked against any OpenACC or OpenMP runtime libraries.

We have got this working, with some restrictions, in isolation, but recently came across a new issue. On our machines, importing numpy before dynamically loading our application leads to a segfault:

$ cat shared.cpp
extern "C" void func() {}
$ nvc++ -acc -gpu=rdc -shared -o libshared.so shared.cpp
$ python -c 'import ctypes, numpy; ctypes.CDLL("./libshared.so")'
Segmentation fault
$ python -c 'import ctypes; ctypes.CDLL("./libshared.so"); import numpy'
<no error>

the backtrace shows:

$ gdb --args python -c 'import ctypes, numpy; ctypes.CDLL("./libshared.so")'
…
(gdb) bt 8
#0 0x00007fffdfa8937a in __ompt_load_return_address (gtid=<optimized out>) at ../../src/ompt-specific.h:90
#1 __kmpc_critical_with_hint (loc=0x0, global_tid=-1, crit=0x7fffd6789680 <smallmem_lock>, hint=0) at ../../src/kmp_csupport.cpp:1468
#2 0x00007fffd6579b42 in __pgi_uacc_smallmem (n=24) at ../../src/smallmem.c:43
#3 0x00007fffd5ff6ad5 in __pgi_uacc_cuda_load_pic_module (pic_pgi_cuda_loc=0x7fffd678e400 <__PGI_CUDA_LOC>, pic_pgi_cuda_cap=0x7fffd678e430 <__PGI_CUDA_CAP>) at ../../src/cuda_init.c:1836
#4 0x00007fffd678b0d7 in __pgi_uacc_set_shared () from ./libshared.so
#5 0x00007fffd678b01f in _init () from ./libshared.so
#6 0x00007fffffffb928 in ?? ()
#7 0x00007fffed8f297f in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
(More stack frames follow…)

and the issue appears to be that our numpy loads intel-mkl:

$ cat main.cpp
#include <dlfcn.h>
#include <stdexcept>
using func_t = void(*)();
int main() {
 void* h = dlopen("./libshared.so", RTLD_NOW);
 if(!h) { throw std::runtime_error(dlerror()); }
 auto* func = reinterpret_cast<func_t>(dlsym(h, "func"));
 if(!func) { throw std::runtime_error(dlerror()); }
 func();
 return 0;
}
$ g++ -ldl -o main main.cpp
$ ./main
<no error>
$ LD_PRELOAD=/path/to/intel-mkl-2020.4.304-rzr3hj/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin/libiomp5.so gdb ./main
...
Program received signal SIGSEGV, Segmentation fault.
0x00007fffed55637a in __ompt_load_return_address (gtid=<optimized out>) at ../../src/ompt-specific.h:90

in this standalone example then using -gpu=nordc avoids the problem; we are still working on some outstanding issues with using -gpu=nordc in our real application.

The issue took quite some effort to debug, and given the popularity of numpy it may well come up again. Hopefully this post will help other users.

We imagine that the issue might be related to some underlying LLVM code being used by both intel-mkl and nvc++ in incompatible ways, but that is only a suspicion. If this can be made more robust in future then that would obviously be great.

1 Like

Hi Olli,

I tested this on my system but it worked fine. But my numpy doesn’t use MKL.

From the error, my guess is that Intel’s OpenMP runtime library is getting called rather than our NVOMP runtime, which are incompatible.

Try adding the flag “-nomp” when creating the shared object so NVOMP isn’t linked in.

Alternately, if you can install numpy without MKL, or better yet use cupy so the Python code can be offloaded to the GPU as well.

https://docs.cupy.dev/en/stable/overview.html

-Mat