Hello,
We have been working on making the GPU version of our application usable as a shared library (e.g. to be loaded from Python).
Unfortunately there seem to be some edge cases that are not working well. When some more complex components of our application are enabled (at compile time), we see errors from CUDA/OpenACC in code that works when we link statically.
A distilled version of one of these failures is:
struct A {
int a{};
};
A f() { return {}; }
void launcher() {
#pragma acc kernels
{
f();
}
}
if this code is compiled + linked into a shared library:
$ nvc++ -Minfo=accel -acc -cuda -gpu=cuda11.7,lineinfo,cc70,rdc -fPIC -shared -o libtest.so pp.cpp
f():
4, Generating implicit acc routine seq
Generating acc routine seq
Generating NVIDIA GPU code
launcher():
7, Accelerator serial kernel generated
Generating NVIDIA GPU code
and linked against a trivial application (that does not even call launcher()):
#include <openacc.h>
int main() {
acc_set_device_num(0, acc_device_nvidia);
}
with
nvc++ -Minfo=accel -acc -cuda -gpu=cuda11.7,lineinfo,cc70,rdc -fPIC -o main main.cpp -L. -ltest
then the resulting binary fails with
Failing in Thread:1
call to cudaGetSymbolAddress returned error 13: Other
and if I set a break point on cudaGetSymbolAddress I see
(gdb) bt
#0 0x00007fffecedd290 in cudaGetSymbolAddress () from /path/to/nvhpc-22.5-5ilgib/Linux_x86_64/22.5/cuda/11.7/lib64/libcudart.so.11.0
#1 0x00007fffed156b14 in __pgi_uacc_cuda_static (hostptr=0x7fffedaff7c8 <_T2_6948.7592>, size=4, name=0x7fffedafb098 <.S07772> "_T2_6948", flags=1280, dindex=1, devid=1) at ../../src/cuda_static.c:194
#2 0x00007fffed156d1d in walk_cuda_static (r=0x420d30, userdata=0x7fffffff70f0) at ../../src/cuda_static.c:236
#3 0x00007fffed6d1b9e in _rb_walk () from /path/to/nvhpc-22.5-5ilgib/Linux_x86_64/22.5/compilers/lib/libacchost.so
#4 0x00007fffed6d1c31 in __pgi_uacc_rb_walk () from /path/to/nvhpc-22.5-5ilgib/Linux_x86_64/22.5/compilers/lib/libacchost.so
#5 0x00007fffed156e58 in __pgi_uacc_cuda_static_create (dindex=1) at ../../src/cuda_static.c:284
#6 0x00007fffed155a2b in __pgi_uacc_cuda_load_this_module (dindex=1, error=0, pgi_cuda_loc=0x40e0c0 <__PGI_CUDA_LOC>, skip_static_init=0) at ../../src/cuda_init.c:1770
#7 0x00007fffed155e1a in __pgi_uacc_cuda_load_module (dindex=1, error=0) at ../../src/cuda_init.c:1900
#8 0x00007fffed6b87fa in __pgi_uacc_init_device (dindex=1) at ../../src/init.c:750
#9 0x00007fffed156808 in __pgi_uacc_cuda_set_device (dindex=1) at ../../src/cuda_init.c:2195
#10 0x00007fffed6bdf28 in __pgi_uacc_set_device_num (devnum=0, devtype=acc_device_nvidia) at ../../src/set_device.c:68
#11 0x00007fffed68cf0e in acc_set_device_num (devnum=0, devtype=acc_device_nvidia) at ../../src/acc_set_device_num.c:46
I can reproduce this with both 22.3 (-gpu=cuda11.6
) and 22.5 with -gpu=rdc
. This reproducer does not give an error if I use -gpu=nordc
, but in the full application then using -gpu=nordc
appears to cause other issues. We have seen in other forum posts (e.g. Dynamically loading an OpenACC-enabled shared library from an executable compiled with nvc++ does not work) that -gpu=nordc
is recommended, and that -gpu=rdc
may have problems with C++ code.
Firstly, I hope that if you think this example exposes a compiler bug that the reproducer helps to get it fixed.
Secondly, I have two questions:
- Do you recommend in this situation that we focus on fixing the issues exposed by
-gpu=nordc
, or persist with-gpu=rdc
? At present, the issues with-gpu=nordc
seem “worse”, as judged by high level metrics like the number of test failures, but it may be that the underlying causes are less severe. - Do you anticipate significant differences between 22.3 and 22.5 in this area? At present we are still using 22.3 as a baseline, pending proper investigation of a probably-unrelated issue with a different part of our code and 22.5. However, if shared library support in 22.5 is expected to be significantly better then we could reprioritise.
Thanks in advance!
Olli