Nvc++ OpenACC runtime segfaults if Intel MKL (numpy) is already loaded

olupton · April 27, 2022, 1:36pm

Hello,

We are currently working on making the GPU-enabled version of an application dynamically loadable from Python, which implies bundling GPU-enabled code in a shared library that is dlopen’d by an executable (Python) that is not linked against any OpenACC or OpenMP runtime libraries.

We have got this working, with some restrictions, in isolation, but recently came across a new issue. On our machines, importing numpy before dynamically loading our application leads to a segfault:

$ cat shared.cpp
extern "C" void func() {}
$ nvc++ -acc -gpu=rdc -shared -o libshared.so shared.cpp
$ python -c 'import ctypes, numpy; ctypes.CDLL("./libshared.so")'
Segmentation fault
$ python -c 'import ctypes; ctypes.CDLL("./libshared.so"); import numpy'
<no error>

the backtrace shows:

$ gdb --args python -c 'import ctypes, numpy; ctypes.CDLL("./libshared.so")'
…
(gdb) bt 8
#0 0x00007fffdfa8937a in __ompt_load_return_address (gtid=<optimized out>) at ../../src/ompt-specific.h:90
#1 __kmpc_critical_with_hint (loc=0x0, global_tid=-1, crit=0x7fffd6789680 <smallmem_lock>, hint=0) at ../../src/kmp_csupport.cpp:1468
#2 0x00007fffd6579b42 in __pgi_uacc_smallmem (n=24) at ../../src/smallmem.c:43
#3 0x00007fffd5ff6ad5 in __pgi_uacc_cuda_load_pic_module (pic_pgi_cuda_loc=0x7fffd678e400 <__PGI_CUDA_LOC>, pic_pgi_cuda_cap=0x7fffd678e430 <__PGI_CUDA_CAP>) at ../../src/cuda_init.c:1836
#4 0x00007fffd678b0d7 in __pgi_uacc_set_shared () from ./libshared.so
#5 0x00007fffd678b01f in _init () from ./libshared.so
#6 0x00007fffffffb928 in ?? ()
#7 0x00007fffed8f297f in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
(More stack frames follow…)

and the issue appears to be that our numpy loads intel-mkl:

$ cat main.cpp
#include <dlfcn.h>
#include <stdexcept>
using func_t = void(*)();
int main() {
 void* h = dlopen("./libshared.so", RTLD_NOW);
 if(!h) { throw std::runtime_error(dlerror()); }
 auto* func = reinterpret_cast<func_t>(dlsym(h, "func"));
 if(!func) { throw std::runtime_error(dlerror()); }
 func();
 return 0;
}
$ g++ -ldl -o main main.cpp
$ ./main
<no error>
$ LD_PRELOAD=/path/to/intel-mkl-2020.4.304-rzr3hj/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin/libiomp5.so gdb ./main
...
Program received signal SIGSEGV, Segmentation fault.
0x00007fffed55637a in __ompt_load_return_address (gtid=<optimized out>) at ../../src/ompt-specific.h:90

in this standalone example then using -gpu=nordc avoids the problem; we are still working on some outstanding issues with using -gpu=nordc in our real application.

The issue took quite some effort to debug, and given the popularity of numpy it may well come up again. Hopefully this post will help other users.

We imagine that the issue might be related to some underlying LLVM code being used by both intel-mkl and nvc++ in incompatible ways, but that is only a suspicion. If this can be made more robust in future then that would obviously be great.

MatColgrove · April 27, 2022, 5:14pm

Hi Olli,

I tested this on my system but it worked fine. But my numpy doesn’t use MKL.

From the error, my guess is that Intel’s OpenMP runtime library is getting called rather than our NVOMP runtime, which are incompatible.

Try adding the flag “-nomp” when creating the shared object so NVOMP isn’t linked in.

Alternately, if you can install numpy without MKL, or better yet use cupy so the Python code can be offloaded to the GPU as well.

https://docs.cupy.dev/en/stable/overview.html

-Mat

Oliveira · October 2, 2023, 11:39am

hello,
when i use the “-nomp”,it shows that “ld: /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/compilers/lib/libacchost.so: undefined reference to `__hxHostAvailableCores’”

i write in cmake that
“set(CMAKE_CXX_COMPILER “/opt/nvidia/hpc_sdk/Linux_x86_64/23.1/compilers/bin/pgc++”)
set(CMAKE_CXX_FLAGS “-acc -Minfo=accel -noswitcherror -nomp”)”
and it is

how can i fix it?

MatColgrove · October 2, 2023, 6:14pm

Hi 3158793232,

Our engineers streamline our runtime libraries is the 22.11 release, which had the effect of creating a dependency with our OpenMP library. This means adding “-nomp” can’t be used as a solution to the original issue.

Can you please describe what you’re doing? Are you encountering issues if you don’t use the “-nomp” flag?

-Mat

Oliveira · October 3, 2023, 1:19am

Hi Mat,
i intend to add openacc to a physics software which had used MPI,openmp,CUDA,so i cereate a kernels subdirectory and set CMAKE_CXX_COMPLIER,CMAKE_CXX_FLAGS, to use the openacc.The kernels is linked as a shared runtime library,but when i achieve it without the “-nomp”,the software can be compiled and linked.however when i try to use it,it become segmentation fault,if i try to use the gdb, it become like this

how can i fix this error.
By the way,when i use the openacc to accelerate the software,the circulation is too complicated,and it will use the class and struct from other files which is not compiled by the openacc,can i use class and strcut without directive “routine”?
Best whihes

MatColgrove · October 3, 2023, 3:27pm

One possibility is that you are building the main program, which uses OpenMP, with a different compiler. This causes the program to need link against multiple OpenMP runtimes which can cause incorrect behavior.

Are you able to build the entire project with nvc++? This should fix the issue since there would only be a single OpenMP runtime.

Another possibility is that since the error is coming from an OMPT routine, you may need compile with “-mp=ompt”. Granted I don’t think this is the problem, but worth a try.

By the way,when i use the openacc to accelerate the software,the circulation is too complicated,and it will use the class and struct from other files which is not compiled by the openacc,can i use class and strcut without directive “routine”?

“routine” would be necessary if you are calling a class method or our subroutine from within an OpenACC compute region. There’s no way around this given “routine” tells the compiler to create the device code needed to execute the subroutine.

If you’re meaning that you want to use the class or struct itself (i.e. the data), then you can include it an OpenACC data region to create the device copy. Often folks will add data management routines with “enter/exit data” directives in the SO which the main program can call. Basically, so long as the SO manages the device data, it can use variables from the main program.

Oliveira · October 6, 2023, 9:30am

Hi,mat
i have try to use a lower version hpc_sdk,and use “-mp=ompt”,but none of them is work,it is show that i have a confict with the openmp.so in intel oneAPI .it there a way i can fix it?

By the way,when i try the hpc_sdk 20.9 (CUDA version is 11.0),when i link the shared library,the error shows

so how can i fix it,thank you.
-Best wishes

MatColgrove · October 6, 2023, 5:18pm

For the second issue, we don’t support the “–gnu++11” flag. You’ll either want to remove the flag from your build, or add the flag “-noswitcherror” to have the compiler ignore unknown flags.

it is show that i have a confict with the openmp.so in intel oneAPI .it there a way i can fix it?

Again ideally, you’d compile all files with nvc++ or with Intel without OpenMP support.

Sans that, one possibility is try pre-loading our OpenMP runtime so there references are link to it rather than Intel’s OpenMP runtime. Not sure it will work by something like:

LD_PRELOAD=/opt/nvidia/hpc_sdk/Linux_x86_64/23.1/compilers/lib/libnvomp.so <exename>

-Mat

Oliveira · October 7, 2023, 8:41am

Hi -mat,
it’s useful when i try the LD_PRELOAD,thanks !

Topic		Replies	Views
Dynamically loading an OpenACC-enabled shared library from an executable compiled with nvc++ does not work nvc, nvc++ and nvfortran	5	964	April 13, 2022
Cannot dynamically load a shared library containing both OpenACC and CUDA code nvc, nvc++ and nvfortran	8	2930	August 24, 2022
Error Generating copyout(srcImg) [if not already present]', and 2) Legacy PGI Compilers	9	1054	September 14, 2022
How to map private dynamic array to the GPU with OpenMP and nvc? nvc, nvc++ and nvfortran	20	503	January 31, 2025
Accelerated Fortran stdpar code failing at runtime nvc, nvc++ and nvfortran	9	141	May 19, 2025
OMP offloading crash with nvc CUDA NVCC Compiler nvcc , offload-features	8	1021	November 29, 2022
Unresolved symbols to OpenACC acc_malloc in my program with nvc++ nvc, nvc++ and nvfortran	21	342	February 15, 2025
OpenMP offload w/ CUDA interop: undefined reference to `__fatbinwrap__NV_MODULE_ID' nvc, nvc++ and nvfortran	5	1243	May 22, 2023
Fort2 TERMINATED by signal 11 nvc, nvc++ and nvfortran	1	71	September 19, 2025
Clarification on using OpenACC in a shared library Legacy PGI Compilers	27	4872	December 9, 2020

Nvc++ OpenACC runtime segfaults if Intel MKL (numpy) is already loaded

Related topics