Cannot dynamically load a shared library containing both OpenACC and CUDA code

We are currently working on making the GPU-enabled version of an application dynamically loadable from Python, which implies bundling GPU-enabled code in a shared library that is dlopen’d by an executable (Python) that is not linked against any OpenACC or OpenMP runtime libraries.

We understand from the forums that this is not expected to work with OpenMP target offload yet (Creating a shared library that utilises OpenMP offloading) so, while we would like to explore this with OpenMP, we are currently focusing on OpenACC.

I recently posted: Dynamically loading an OpenACC-enabled shared library from an executable compiled with nvc++ does not work on a similar topic. In that post we have a simple example working where the offloaded kernel is pure OpenACC. Unfortunately in our real application, we also need to call functions that are defined in CUDA code from OpenACC regions. This is because some external library routines that we use (Eigen, Random123) cannot be reliably used “inline” in OpenACC/OpenMP regions, but are compilable by nvcc for GPU execution.

All of this is currently working with nvc++ if we link the various CUDA/OpenACC object files statically into an executable and launch that; the problems arise when we try and build a shared library and either dynamically link it to an executable, or open it with dlopen .A small standalone example consists of this CUDA code:

A small standalone example consists of this CUDA code:

$ cat cuda.cu
#include <cstdio>
__host__ __device__ void foo() {
#ifdef __CUDA_ARCH__
  std::printf("hello from device\n");
#else
  std::printf("hello from host\n");
#endif
}

and this OpenACC-annotated C++ code:

$ cat directives.cpp
#pragma acc routine seq
void foo();

extern "C" int launch() {
  #pragma acc kernels
  {
    foo();
  }
  return 0;
}

which we build and link into a shared library as follows:

nvc++ -acc -cuda -gpu=cc70,debug -g -fPIC -o directives.o -c directives.cpp
nvcc --generate-code=arch=compute_70,code=[compute_70,sm_70] -g -G -Xcompiler=-fPIC -o cuda.o -dc cuda.cu
nvc++ -acc -cuda -gpu=cc70,debug -g -shared -o libshared_acc_rdc.so directives.o cuda.o

where we are relying on nvc++ to do the device code linking, rather than including an explicit nvcc -dlink step, following the discussion in: Separate compilation of mixed CUDA OpenACC code.

We dynamically load this library with the following executable:

$ cat dynamic.cpp
#include <dlfcn.h>
#include <stdexcept>
using launch_t = int(*) ();
int main() {
  void* h = dlopen("./libshared_acc_rdc.so", RTLD_NOW);
  if(!h) { throw std::runtime_error{dlerror()}; }
  auto* launch = reinterpret_cast<launch_t>(dlsym(h, "launch"));
  if(!launch) { throw std::runtime_error{dlerror()}; }
  return launch();
}
g++ -ldl -o main_gcc dynamic.cpp

All of the above works without any warnings/errors, but actually executing main_gcc segfaults with

(gdb) bt
#0 0x00007fffebf81981 in __pgi_cuda_register_fat_binaryA (fatCubin0=0x7fffedaf55ac <__nv_cudaEntityRegisterCallback(void**)>, pgi_cuda_loc=0x0) at /proj/build/21C/Linux_x86_64/rte/cudafor/src/cudaregister.c:157
#1 0x00007fffebf81b33 in __pgi_cuda_register_fat_binary (fatCubin=0x7fffedaf55ac <__nv_cudaEntityRegisterCallback(void**)>) at /proj/build/21C/Linux_x86_64/rte/cudafor/src/cudaregister.c:217
#2 0x00007fffedaf5205 in __cudaRegisterLinkedBinary_b140de3b_7_cuda_cu__Z3foov () from ./libshared_acc_rdc.so
#3 0x00007fffedaf5603 in __sti____cudaRegisterAll () at /path/to/tmp/tmpxft_0001a6b4_00000000-6_cuda.cudafe1.stub.c:14
#4 0x00007fffed8f29c3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#5 0x00007fffed8f759e in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#6 0x00007fffed8f27d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#7 0x00007fffed8f6b8b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#8 0x00007fffed6dffab in dlopen_doit () from /lib64/libdl.so.2
#9 0x00007fffed8f27d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#10 0x00007fffed6e05ad in _dlerror_run () from /lib64/libdl.so.2
#11 0x00007fffed6e0041 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#12 0x00000000004011e0 in main ()

which appears to be something to do with the device linking of CUDA code done by nvc++ (?).

A brief summary of other, unsuccessful experiments:

  • Not actually calling foo() from OpenACC makes no difference, it still crashes as above.
  • Adding an explicit device-linking step ( nvcc --generate-code=arch=compute_70,code=[compute_70,sm_70] -g -G -Xcompiler=-fPIC -dlink -o dlink.o directives.o cuda.o ) and adding the result to the shared library ( nvc++ -acc -cuda -gpu=cc70,debug -g -shared -o libshared_acc_rdc.so directives.o cuda.o dlink.o ) causes dlopen to fail with undefined symbol: __fatbinwrap_14directives_cpp , which seems consistent with: Separate compilation of mixed CUDA OpenACC code.
  • Disabling relocatable device code ( -gpu=nordc for nvc++ and replacing -dc with -dw for nvcc ) causes directives.cpp to fail to compile with ptxas fatal : Unresolved extern function '_Z3foov' (even if I append cuda.o to the command line). This experiment was inspired by: problem of openacc compiled shared lib on linux.

Note that the main function that calls dlopen is compiled with g++ , not nvc++ , because of this issue: Dynamically loading an OpenACC-enabled shared library from an executable compiled with nvc++ does not work.

Is there anything else we can try to create a dynamically loadable library containing both OpenACC and CUDA code?

1 Like

Hi Olli,

Same answer as your first question, compile the C++ code with “-gpu=nordc”. However this one’s a bit more tricky in that in order to call the CUDA device routine, you need RDC enabled so the device linker can resolve the symbol.

What I’d try is merging the cuda.cu file in with the directives.cpp file so the routine can be inlined instead of called thus removing the need for the link step. While not fully supported, nvc++ can compile CUDA code. However since nvc++ is a single pass compiler, it cannot not support “__CUDA_ARCH__”. Instead we’ve added a constexpr “if target(nv::target::is_device)” which can be evaluated at compile time to mimic the behavior.

Normally I’d write an example for you, but my home internet is currently down and I can’t get VPN to work through my tethered cell phone so can’t get to my systems. Once back, I’ll post a follow-up with an example. Though in the mean time, you can see an example if posted HERE.

-Mat

I was able to get VPN to work through my phone.

Here’s an example:

% cat dynamic.cpp
#include <dlfcn.h>
#include <stdexcept>
using launch_t = int(*) ();
int main() {
  void* h = dlopen("./libshared_acc_rdc.so", RTLD_NOW);
  if(!h) { throw std::runtime_error{dlerror()}; }
  auto* launch = reinterpret_cast<launch_t>(dlsym(h, "launch"));
  if(!launch) { throw std::runtime_error{dlerror()}; }
  return launch();
}

% cat directives.cpp
#include <cstdio>
#include <nv/target>

void foo() {
if target(nv::target::is_device) {
  std::printf("hello from device\n");
} else {
  std::printf("hello from host\n");
}
}

void foo();

extern "C" int launch() {
  #pragma acc kernels
  {
    foo();
  }
  return 0;
}

% nvc++ -acc -cuda -gpu=nordc -shared -o libshared_acc_rdc.so directives.cpp -V22.3
% g++ -o main_gcc dynamic.cpp -ldl
% ./main_gcc
hello from device

Hi Mat,

Thanks for the example. I can build and run it locally and get the same results as you, which is good, but I noticed that if I run it under compute-sanitizer or cuda-memcheck then I see various errors/messages like:

$ compute-sanitizer ./main_gcc
========= COMPUTE-SANITIZER
========= Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not found" on CUDA API call to cuGetProcAddress.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x217350]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x27460]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/cuda/11.6/lib64/libcudart.so.11.0
=========     Host Frame: [0x2e8d7]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/cuda/11.6/lib64/libcudart.so.11.0
=========     Host Frame: [0x316c8]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/cuda/11.6/lib64/libcudart.so.11.0
=========     Host Frame:__pthread_once_slow [0x620b]
=========                in /lib64/libpthread.so.0
=========     Host Frame: [0x75ce9]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/cuda/11.6/lib64/libcudart.so.11.0
=========     Host Frame: [0x23737]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/cuda/11.6/lib64/libcudart.so.11.0
=========     Host Frame:cudaGetDevice [0x485c1]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/cuda/11.6/lib64/libcudart.so.11.0
=========     Host Frame:../../src/cuda_init.c:1224:__pgi_uacc_cuda_initdev [0x273df]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/compilers/lib/libacccuda.so
=========     Host Frame:../../src/init.c:554:__pgi_uacc_enumerate [0x46601]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/compilers/lib/libacchost.so
=========     Host Frame:../../src/init.c:635:__pgi_uacc_initialize [0x469fc]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/compilers/lib/libacchost.so
=========     Host Frame:../../src/enter.c:43:__pgi_uacc_enter [0x43a51]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/compilers/lib/libacchost.so
=========     Host Frame:/gpfs/bbp.cscs.ch/home/olupton/nvhpc-gpu-code-in-shared-library/mat/directives.cpp:13:launch [0xb332]
=========                in ./libshared_acc_rdc.so
=========     Host Frame:main [0x126b]
=========                in /gpfs/bbp.cscs.ch/home/olupton/nvhpc-gpu-code-in-shared-library/mat/./main_gcc
=========     Host Frame:__libc_start_main [0x22555]
=========                in /lib64/libc.so.6
=========     Host Frame: [0x1119]
=========                in /gpfs/bbp.cscs.ch/home/olupton/nvhpc-gpu-code-in-shared-library/mat/./main_gcc

is this something that is expected and can safely be ignored?

We can try a few more combinations of CUDA syntax and nvc++ to see if it’s possible to “inline” our usage of Eigen in the way that you suggest; in the past we have only managed to persuade Eigen to work in GPU code with nvcc (e.g. Error using atomics in OpenMP offload region was the result of the most recent effort to use nvc++).

Yes, you can safely ignore this. The OpenACC runtime is testing if the pointers are CUDA device pointers, which the are not, hence the error from the CUDA runtime.