cudaGetSymbolAddress error when mixing OpenACC and shared libraries

Hello,

We have been working on making the GPU version of our application usable as a shared library (e.g. to be loaded from Python).

Unfortunately there seem to be some edge cases that are not working well. When some more complex components of our application are enabled (at compile time), we see errors from CUDA/OpenACC in code that works when we link statically.

A distilled version of one of these failures is:

struct A {
  int a{};
};
A f() { return {}; }
void launcher() {
  #pragma acc kernels
  {
    f();
  }
}

if this code is compiled + linked into a shared library:

$ nvc++ -Minfo=accel -acc -cuda -gpu=cuda11.7,lineinfo,cc70,rdc -fPIC -shared -o libtest.so pp.cpp
f():
      4, Generating implicit acc routine seq
         Generating acc routine seq
         Generating NVIDIA GPU code
launcher():
      7, Accelerator serial kernel generated
         Generating NVIDIA GPU code

and linked against a trivial application (that does not even call launcher()):

#include <openacc.h>
int main() {
  acc_set_device_num(0, acc_device_nvidia);
}

with

nvc++ -Minfo=accel -acc -cuda -gpu=cuda11.7,lineinfo,cc70,rdc -fPIC -o main main.cpp -L. -ltest

then the resulting binary fails with

Failing in Thread:1
call to cudaGetSymbolAddress returned error 13: Other

and if I set a break point on cudaGetSymbolAddress I see

(gdb) bt
#0  0x00007fffecedd290 in cudaGetSymbolAddress () from /path/to/nvhpc-22.5-5ilgib/Linux_x86_64/22.5/cuda/11.7/lib64/libcudart.so.11.0
#1  0x00007fffed156b14 in __pgi_uacc_cuda_static (hostptr=0x7fffedaff7c8 <_T2_6948.7592>, size=4, name=0x7fffedafb098 <.S07772> "_T2_6948", flags=1280, dindex=1, devid=1) at ../../src/cuda_static.c:194
#2  0x00007fffed156d1d in walk_cuda_static (r=0x420d30, userdata=0x7fffffff70f0) at ../../src/cuda_static.c:236
#3  0x00007fffed6d1b9e in _rb_walk () from /path/to/nvhpc-22.5-5ilgib/Linux_x86_64/22.5/compilers/lib/libacchost.so
#4  0x00007fffed6d1c31 in __pgi_uacc_rb_walk () from /path/to/nvhpc-22.5-5ilgib/Linux_x86_64/22.5/compilers/lib/libacchost.so
#5  0x00007fffed156e58 in __pgi_uacc_cuda_static_create (dindex=1) at ../../src/cuda_static.c:284
#6  0x00007fffed155a2b in __pgi_uacc_cuda_load_this_module (dindex=1, error=0, pgi_cuda_loc=0x40e0c0 <__PGI_CUDA_LOC>, skip_static_init=0) at ../../src/cuda_init.c:1770
#7  0x00007fffed155e1a in __pgi_uacc_cuda_load_module (dindex=1, error=0) at ../../src/cuda_init.c:1900
#8  0x00007fffed6b87fa in __pgi_uacc_init_device (dindex=1) at ../../src/init.c:750
#9  0x00007fffed156808 in __pgi_uacc_cuda_set_device (dindex=1) at ../../src/cuda_init.c:2195
#10 0x00007fffed6bdf28 in __pgi_uacc_set_device_num (devnum=0, devtype=acc_device_nvidia) at ../../src/set_device.c:68
#11 0x00007fffed68cf0e in acc_set_device_num (devnum=0, devtype=acc_device_nvidia) at ../../src/acc_set_device_num.c:46

I can reproduce this with both 22.3 (-gpu=cuda11.6) and 22.5 with -gpu=rdc. This reproducer does not give an error if I use -gpu=nordc, but in the full application then using -gpu=nordc appears to cause other issues. We have seen in other forum posts (e.g. Dynamically loading an OpenACC-enabled shared library from an executable compiled with nvc++ does not work) that -gpu=nordc is recommended, and that -gpu=rdc may have problems with C++ code.

Firstly, I hope that if you think this example exposes a compiler bug that the reproducer helps to get it fixed.
Secondly, I have two questions:

  1. Do you recommend in this situation that we focus on fixing the issues exposed by -gpu=nordc, or persist with -gpu=rdc? At present, the issues with -gpu=nordc seem “worse”, as judged by high level metrics like the number of test failures, but it may be that the underlying causes are less severe.
  2. Do you anticipate significant differences between 22.3 and 22.5 in this area? At present we are still using 22.3 as a baseline, pending proper investigation of a probably-unrelated issue with a different part of our code and 22.5. However, if shared library support in 22.5 is expected to be significantly better then we could reprioritise.

Thanks in advance!
Olli

1 Like

Hi Olli,

For some reason I couldn’t reproduce the error with the original code, but after adding an explicit declaration of a global “A” struct, then I was able to. Looks like a problem with creating global device structs. Note in your case, the “T2_…” symbols is an anonymous object, most likely the return type from “f”.

I added a problem report, TPR #32120, and sent if to engineering for investigation.

For question #1, focus on the RDC issues, Once we were able to start performing the device link within a shared object, nordc shouldn’t be necessary and I rather move forward with these. nordc can be a work around but you wont be able to do anything that requires linking, such as calling functions or accessing global variables in separate objects.

For #2, I’d recommend moving forward if possible, and 22.7 should be out soon. We’ve had the most issues with C++ shared objects, both in OpenACC and OpenMP, and have several other reports that we’re working through. If I remember correctly we had at least one issue resolved in 22.5.

Then again, changing your baseline every few months may not be feasible, so it’s up to you what’s best for your project.

-Mat