Nvc++ OpenACC runtime segfaults if Intel MKL (numpy) is already loaded

Hello,

We are currently working on making the GPU-enabled version of an application dynamically loadable from Python, which implies bundling GPU-enabled code in a shared library that is dlopen’d by an executable (Python) that is not linked against any OpenACC or OpenMP runtime libraries.

We have got this working, with some restrictions, in isolation, but recently came across a new issue. On our machines, importing numpy before dynamically loading our application leads to a segfault:

$ cat shared.cpp
extern "C" void func() {}
$ nvc++ -acc -gpu=rdc -shared -o libshared.so shared.cpp
$ python -c 'import ctypes, numpy; ctypes.CDLL("./libshared.so")'
Segmentation fault
$ python -c 'import ctypes; ctypes.CDLL("./libshared.so"); import numpy'
<no error>

the backtrace shows:

$ gdb --args python -c 'import ctypes, numpy; ctypes.CDLL("./libshared.so")'
…
(gdb) bt 8
#0 0x00007fffdfa8937a in __ompt_load_return_address (gtid=<optimized out>) at ../../src/ompt-specific.h:90
#1 __kmpc_critical_with_hint (loc=0x0, global_tid=-1, crit=0x7fffd6789680 <smallmem_lock>, hint=0) at ../../src/kmp_csupport.cpp:1468
#2 0x00007fffd6579b42 in __pgi_uacc_smallmem (n=24) at ../../src/smallmem.c:43
#3 0x00007fffd5ff6ad5 in __pgi_uacc_cuda_load_pic_module (pic_pgi_cuda_loc=0x7fffd678e400 <__PGI_CUDA_LOC>, pic_pgi_cuda_cap=0x7fffd678e430 <__PGI_CUDA_CAP>) at ../../src/cuda_init.c:1836
#4 0x00007fffd678b0d7 in __pgi_uacc_set_shared () from ./libshared.so
#5 0x00007fffd678b01f in _init () from ./libshared.so
#6 0x00007fffffffb928 in ?? ()
#7 0x00007fffed8f297f in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
(More stack frames follow…)

and the issue appears to be that our numpy loads intel-mkl:

$ cat main.cpp
#include <dlfcn.h>
#include <stdexcept>
using func_t = void(*)();
int main() {
 void* h = dlopen("./libshared.so", RTLD_NOW);
 if(!h) { throw std::runtime_error(dlerror()); }
 auto* func = reinterpret_cast<func_t>(dlsym(h, "func"));
 if(!func) { throw std::runtime_error(dlerror()); }
 func();
 return 0;
}
$ g++ -ldl -o main main.cpp
$ ./main
<no error>
$ LD_PRELOAD=/path/to/intel-mkl-2020.4.304-rzr3hj/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin/libiomp5.so gdb ./main
...
Program received signal SIGSEGV, Segmentation fault.
0x00007fffed55637a in __ompt_load_return_address (gtid=<optimized out>) at ../../src/ompt-specific.h:90

in this standalone example then using -gpu=nordc avoids the problem; we are still working on some outstanding issues with using -gpu=nordc in our real application.

The issue took quite some effort to debug, and given the popularity of numpy it may well come up again. Hopefully this post will help other users.

We imagine that the issue might be related to some underlying LLVM code being used by both intel-mkl and nvc++ in incompatible ways, but that is only a suspicion. If this can be made more robust in future then that would obviously be great.

1 Like

Hi Olli,

I tested this on my system but it worked fine. But my numpy doesn’t use MKL.

From the error, my guess is that Intel’s OpenMP runtime library is getting called rather than our NVOMP runtime, which are incompatible.

Try adding the flag “-nomp” when creating the shared object so NVOMP isn’t linked in.

Alternately, if you can install numpy without MKL, or better yet use cupy so the Python code can be offloaded to the GPU as well.

https://docs.cupy.dev/en/stable/overview.html

-Mat

hello,
when i use the “-nomp”,it shows that “ld: /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/compilers/lib/libacchost.so: undefined reference to `__hxHostAvailableCores’”


i write in cmake that
“set(CMAKE_CXX_COMPILER “/opt/nvidia/hpc_sdk/Linux_x86_64/23.1/compilers/bin/pgc++”)
set(CMAKE_CXX_FLAGS “-acc -Minfo=accel -noswitcherror -nomp”)”
and it is

how can i fix it?

Hi 3158793232,

Our engineers streamline our runtime libraries is the 22.11 release, which had the effect of creating a dependency with our OpenMP library. This means adding “-nomp” can’t be used as a solution to the original issue.

Can you please describe what you’re doing? Are you encountering issues if you don’t use the “-nomp” flag?

-Mat

Hi Mat,
i intend to add openacc to a physics software which had used MPI,openmp,CUDA,so i cereate a kernels subdirectory and set CMAKE_CXX_COMPLIER,CMAKE_CXX_FLAGS, to use the openacc.The kernels is linked as a shared runtime library,but when i achieve it without the “-nomp”,the software can be compiled and linked.however when i try to use it,it become segmentation fault,if i try to use the gdb, it become like this


how can i fix this error.
By the way,when i use the openacc to accelerate the software,the circulation is too complicated,and it will use the class and struct from other files which is not compiled by the openacc,can i use class and strcut without directive “routine”?
Best whihes

One possibility is that you are building the main program, which uses OpenMP, with a different compiler. This causes the program to need link against multiple OpenMP runtimes which can cause incorrect behavior.

Are you able to build the entire project with nvc++? This should fix the issue since there would only be a single OpenMP runtime.

Another possibility is that since the error is coming from an OMPT routine, you may need compile with “-mp=ompt”. Granted I don’t think this is the problem, but worth a try.

By the way,when i use the openacc to accelerate the software,the circulation is too complicated,and it will use the class and struct from other files which is not compiled by the openacc,can i use class and strcut without directive “routine”?

“routine” would be necessary if you are calling a class method or our subroutine from within an OpenACC compute region. There’s no way around this given “routine” tells the compiler to create the device code needed to execute the subroutine.

If you’re meaning that you want to use the class or struct itself (i.e. the data), then you can include it an OpenACC data region to create the device copy. Often folks will add data management routines with “enter/exit data” directives in the SO which the main program can call. Basically, so long as the SO manages the device data, it can use variables from the main program.

Hi,mat
i have try to use a lower version hpc_sdk,and use “-mp=ompt”,but none of them is work,it is show that i have a confict with the openmp.so in intel oneAPI .it there a way i can fix it?


By the way,when i try the hpc_sdk 20.9 (CUDA version is 11.0),when i link the shared library,the error shows

so how can i fix it,thank you.
-Best wishes

For the second issue, we don’t support the “–gnu++11” flag. You’ll either want to remove the flag from your build, or add the flag “-noswitcherror” to have the compiler ignore unknown flags.

it is show that i have a confict with the openmp.so in intel oneAPI .it there a way i can fix it?

Again ideally, you’d compile all files with nvc++ or with Intel without OpenMP support.

Sans that, one possibility is try pre-loading our OpenMP runtime so there references are link to it rather than Intel’s OpenMP runtime. Not sure it will work by something like:

LD_PRELOAD=/opt/nvidia/hpc_sdk/Linux_x86_64/23.1/compilers/lib/libnvomp.so <exename>

-Mat

1 Like

Hi -mat,
it’s useful when i try the LD_PRELOAD,thanks !