Cannot dynamically load a shared library containing both OpenACC and CUDA code

We are currently working on making the GPU-enabled version of an application dynamically loadable from Python, which implies bundling GPU-enabled code in a shared library that is dlopen’d by an executable (Python) that is not linked against any OpenACC or OpenMP runtime libraries.

We understand from the forums that this is not expected to work with OpenMP target offload yet (Creating a shared library that utilises OpenMP offloading) so, while we would like to explore this with OpenMP, we are currently focusing on OpenACC.

I recently posted: Dynamically loading an OpenACC-enabled shared library from an executable compiled with nvc++ does not work on a similar topic. In that post we have a simple example working where the offloaded kernel is pure OpenACC. Unfortunately in our real application, we also need to call functions that are defined in CUDA code from OpenACC regions. This is because some external library routines that we use (Eigen, Random123) cannot be reliably used “inline” in OpenACC/OpenMP regions, but are compilable by nvcc for GPU execution.

All of this is currently working with nvc++ if we link the various CUDA/OpenACC object files statically into an executable and launch that; the problems arise when we try and build a shared library and either dynamically link it to an executable, or open it with dlopen .A small standalone example consists of this CUDA code:

A small standalone example consists of this CUDA code:

$ cat cuda.cu
#include <cstdio>
__host__ __device__ void foo() {
#ifdef __CUDA_ARCH__
  std::printf("hello from device\n");
#else
  std::printf("hello from host\n");
#endif
}

and this OpenACC-annotated C++ code:

$ cat directives.cpp
#pragma acc routine seq
void foo();

extern "C" int launch() {
  #pragma acc kernels
  {
    foo();
  }
  return 0;
}

which we build and link into a shared library as follows:

nvc++ -acc -cuda -gpu=cc70,debug -g -fPIC -o directives.o -c directives.cpp
nvcc --generate-code=arch=compute_70,code=[compute_70,sm_70] -g -G -Xcompiler=-fPIC -o cuda.o -dc cuda.cu
nvc++ -acc -cuda -gpu=cc70,debug -g -shared -o libshared_acc_rdc.so directives.o cuda.o

where we are relying on nvc++ to do the device code linking, rather than including an explicit nvcc -dlink step, following the discussion in: Separate compilation of mixed CUDA OpenACC code.

We dynamically load this library with the following executable:

$ cat dynamic.cpp
#include <dlfcn.h>
#include <stdexcept>
using launch_t = int(*) ();
int main() {
  void* h = dlopen("./libshared_acc_rdc.so", RTLD_NOW);
  if(!h) { throw std::runtime_error{dlerror()}; }
  auto* launch = reinterpret_cast<launch_t>(dlsym(h, "launch"));
  if(!launch) { throw std::runtime_error{dlerror()}; }
  return launch();
}
g++ -ldl -o main_gcc dynamic.cpp

All of the above works without any warnings/errors, but actually executing main_gcc segfaults with

(gdb) bt
#0 0x00007fffebf81981 in __pgi_cuda_register_fat_binaryA (fatCubin0=0x7fffedaf55ac <__nv_cudaEntityRegisterCallback(void**)>, pgi_cuda_loc=0x0) at /proj/build/21C/Linux_x86_64/rte/cudafor/src/cudaregister.c:157
#1 0x00007fffebf81b33 in __pgi_cuda_register_fat_binary (fatCubin=0x7fffedaf55ac <__nv_cudaEntityRegisterCallback(void**)>) at /proj/build/21C/Linux_x86_64/rte/cudafor/src/cudaregister.c:217
#2 0x00007fffedaf5205 in __cudaRegisterLinkedBinary_b140de3b_7_cuda_cu__Z3foov () from ./libshared_acc_rdc.so
#3 0x00007fffedaf5603 in __sti____cudaRegisterAll () at /path/to/tmp/tmpxft_0001a6b4_00000000-6_cuda.cudafe1.stub.c:14
#4 0x00007fffed8f29c3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#5 0x00007fffed8f759e in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#6 0x00007fffed8f27d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#7 0x00007fffed8f6b8b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#8 0x00007fffed6dffab in dlopen_doit () from /lib64/libdl.so.2
#9 0x00007fffed8f27d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#10 0x00007fffed6e05ad in _dlerror_run () from /lib64/libdl.so.2
#11 0x00007fffed6e0041 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#12 0x00000000004011e0 in main ()

which appears to be something to do with the device linking of CUDA code done by nvc++ (?).

A brief summary of other, unsuccessful experiments:

  • Not actually calling foo() from OpenACC makes no difference, it still crashes as above.
  • Adding an explicit device-linking step ( nvcc --generate-code=arch=compute_70,code=[compute_70,sm_70] -g -G -Xcompiler=-fPIC -dlink -o dlink.o directives.o cuda.o ) and adding the result to the shared library ( nvc++ -acc -cuda -gpu=cc70,debug -g -shared -o libshared_acc_rdc.so directives.o cuda.o dlink.o ) causes dlopen to fail with undefined symbol: __fatbinwrap_14directives_cpp , which seems consistent with: Separate compilation of mixed CUDA OpenACC code.
  • Disabling relocatable device code ( -gpu=nordc for nvc++ and replacing -dc with -dw for nvcc ) causes directives.cpp to fail to compile with ptxas fatal : Unresolved extern function '_Z3foov' (even if I append cuda.o to the command line). This experiment was inspired by: problem of openacc compiled shared lib on linux.

Note that the main function that calls dlopen is compiled with g++ , not nvc++ , because of this issue: Dynamically loading an OpenACC-enabled shared library from an executable compiled with nvc++ does not work.

Is there anything else we can try to create a dynamically loadable library containing both OpenACC and CUDA code?

1 Like

Hi Olli,

Same answer as your first question, compile the C++ code with “-gpu=nordc”. However this one’s a bit more tricky in that in order to call the CUDA device routine, you need RDC enabled so the device linker can resolve the symbol.

What I’d try is merging the cuda.cu file in with the directives.cpp file so the routine can be inlined instead of called thus removing the need for the link step. While not fully supported, nvc++ can compile CUDA code. However since nvc++ is a single pass compiler, it cannot not support “__CUDA_ARCH__”. Instead we’ve added a constexpr “if target(nv::target::is_device)” which can be evaluated at compile time to mimic the behavior.

Normally I’d write an example for you, but my home internet is currently down and I can’t get VPN to work through my tethered cell phone so can’t get to my systems. Once back, I’ll post a follow-up with an example. Though in the mean time, you can see an example if posted HERE.

-Mat

I was able to get VPN to work through my phone.

Here’s an example:

% cat dynamic.cpp
#include <dlfcn.h>
#include <stdexcept>
using launch_t = int(*) ();
int main() {
  void* h = dlopen("./libshared_acc_rdc.so", RTLD_NOW);
  if(!h) { throw std::runtime_error{dlerror()}; }
  auto* launch = reinterpret_cast<launch_t>(dlsym(h, "launch"));
  if(!launch) { throw std::runtime_error{dlerror()}; }
  return launch();
}

% cat directives.cpp
#include <cstdio>
#include <nv/target>

void foo() {
if target(nv::target::is_device) {
  std::printf("hello from device\n");
} else {
  std::printf("hello from host\n");
}
}

void foo();

extern "C" int launch() {
  #pragma acc kernels
  {
    foo();
  }
  return 0;
}

% nvc++ -acc -cuda -gpu=nordc -shared -o libshared_acc_rdc.so directives.cpp -V22.3
% g++ -o main_gcc dynamic.cpp -ldl
% ./main_gcc
hello from device

Hi Mat,

Thanks for the example. I can build and run it locally and get the same results as you, which is good, but I noticed that if I run it under compute-sanitizer or cuda-memcheck then I see various errors/messages like:

$ compute-sanitizer ./main_gcc
========= COMPUTE-SANITIZER
========= Program hit CUDA_ERROR_NOT_FOUND (error 500) due to "named symbol not found" on CUDA API call to cuGetProcAddress.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x217350]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x27460]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/cuda/11.6/lib64/libcudart.so.11.0
=========     Host Frame: [0x2e8d7]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/cuda/11.6/lib64/libcudart.so.11.0
=========     Host Frame: [0x316c8]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/cuda/11.6/lib64/libcudart.so.11.0
=========     Host Frame:__pthread_once_slow [0x620b]
=========                in /lib64/libpthread.so.0
=========     Host Frame: [0x75ce9]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/cuda/11.6/lib64/libcudart.so.11.0
=========     Host Frame: [0x23737]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/cuda/11.6/lib64/libcudart.so.11.0
=========     Host Frame:cudaGetDevice [0x485c1]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/cuda/11.6/lib64/libcudart.so.11.0
=========     Host Frame:../../src/cuda_init.c:1224:__pgi_uacc_cuda_initdev [0x273df]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/compilers/lib/libacccuda.so
=========     Host Frame:../../src/init.c:554:__pgi_uacc_enumerate [0x46601]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/compilers/lib/libacchost.so
=========     Host Frame:../../src/init.c:635:__pgi_uacc_initialize [0x469fc]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/compilers/lib/libacchost.so
=========     Host Frame:../../src/enter.c:43:__pgi_uacc_enter [0x43a51]
=========                in /gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.3-ukx56m/Linux_x86_64/22.3/compilers/lib/libacchost.so
=========     Host Frame:/gpfs/bbp.cscs.ch/home/olupton/nvhpc-gpu-code-in-shared-library/mat/directives.cpp:13:launch [0xb332]
=========                in ./libshared_acc_rdc.so
=========     Host Frame:main [0x126b]
=========                in /gpfs/bbp.cscs.ch/home/olupton/nvhpc-gpu-code-in-shared-library/mat/./main_gcc
=========     Host Frame:__libc_start_main [0x22555]
=========                in /lib64/libc.so.6
=========     Host Frame: [0x1119]
=========                in /gpfs/bbp.cscs.ch/home/olupton/nvhpc-gpu-code-in-shared-library/mat/./main_gcc

is this something that is expected and can safely be ignored?

We can try a few more combinations of CUDA syntax and nvc++ to see if it’s possible to “inline” our usage of Eigen in the way that you suggest; in the past we have only managed to persuade Eigen to work in GPU code with nvcc (e.g. Error using atomics in OpenMP offload region was the result of the most recent effort to use nvc++).

Yes, you can safely ignore this. The OpenACC runtime is testing if the pointers are CUDA device pointers, which the are not, hence the error from the CUDA runtime.

Hi Mat,

I have been looking into this again, and have so far not managed to find any way of persuading nvc++ to swallow the Eigen code that we want to execute in a device kernel. nvcc can still compile this code, but I have not been able to avoid the runtime error in my post above:

#0 0x00007fffebf81981 in __pgi_cuda_register_fat_binaryA (fatCubin0=0x7fffedaf55ac <__nv_cudaEntityRegisterCallback(void**)>, pgi_cuda_loc=0x0) at /proj/build/21C/Linux_x86_64/rte/cudafor/src/cudaregister.c:157
#1 0x00007fffebf81b33 in __pgi_cuda_register_fat_binary (fatCubin=0x7fffedaf55ac <__nv_cudaEntityRegisterCallback(void**)>) at /proj/build/21C/Linux_x86_64/rte/cudafor/src/cudaregister.c:217
#2 0x00007fffedaf5205 in __cudaRegisterLinkedBinary_b140de3b_7_cuda_cu__Z3foov () from ./libshared_acc_rdc.so
#3 0x00007fffedaf5603 in __sti____cudaRegisterAll () at /path/to/tmp/tmpxft_0001a6b4_00000000-6_cuda.cudafe1.stub.c:14
#4 0x00007fffed8f29c3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#5 0x00007fffed8f759e in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#6 0x00007fffed8f27d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#7 0x00007fffed8f6b8b in _dl_open () from /lib64/ld-linux-x86-64.so.2

Are there any workarounds I can try here for calling a nvcc-compiled device function from OpenACC code that works in a shared library context? I’ve mainly still been working with 22.3, but I have also spent a little time trying 22.7 – it mostly looks the same.

Cheers, Olli

Hi Olli,

Sorry, but this isn’t going to work. As part of the shared object device initialization, the runtime needs to register the device code. The registration info is dynamically generated as part of the shared object link step. However when not compiled by nvc++ (nvcc in this case), this register info is missing, and thus causing this runtime error.

The only method I can find to get close to what you want is to use nvc++ to compile the CUDA code, but this means replacing “CUDA_ARCH” with “nv/target”.

-Mat

Hi Mat,

Thanks for the reply. When I looked at this before, the issues with using nvc++ directly to compile Eigen code were deeper than the simple __CUDA_ARCH__ and nv::target stuff in the toy example. One relatively simple issue is the static variable we discussed here: Enabling OpenMP offload breaks OpenACC code, where nvcc was able to compile the code but nvc++ in CUDA mode could not. That issue was relatively easy to work around, but other issues followed. For example, if you clone Eigen (libeigen / eigen · GitLab – either the master from a few days ago (b8e93bf) or the latest tag (3.4.0)) and apply:

--- a/Eigen/src/Core/products/GeneralBlockPanelKernel.h
+++ b/Eigen/src/Core/products/GeneralBlockPanelKernel.h
@@ -85,6 +85,15 @@ struct CacheSizes {
 /** \internal */
 inline void manage_caching_sizes(Action action, std::ptrdiff_t* l1, std::ptrdiff_t* l2, std::ptrdiff_t* l3)
 {
+#ifdef EIGEN_GPUCC
+  if(action==GetAction)
+  {
+    eigen_internal_assert(l1!=0 && l2!=0);
+    *l1 =   32 * 1024;
+    *l2 =   64 * 1024;
+    *l3 =           0;
+  }
+#else
   static CacheSizes m_cacheSizes;

   if(action==SetAction)
@@ -102,6 +111,7 @@ inline void manage_caching_sizes(Action action, std::ptrdiff_t* l1, std::ptrdiff
     *l2 = m_cacheSizes.m_l2;
     *l3 = m_cacheSizes.m_l3;
   }
+#endif
   else
   {
     eigen_internal_assert(false);

to avoid the issue in the other forum post linked above, then

#include <Eigen/Dense>
#include <Eigen/LU>

template <int dim>
using MatType = Eigen::Matrix<double, dim, dim, Eigen::ColMajor, dim, dim>;

__device__ void foo() {
  MatType<2> a;
  a.partialPivLu();
}

compiled with

nvc++ -V22.7 -Ieigen -cuda -c test.cpp

gives

nvvmCompileProgram error 9: NVVM_ERROR_COMPILATION.
Error: /tmp/pgaccXedtznI5ly7.gpu (57620, 47): parse use of undefined value '@_ZN5Eigen8internal4pmaxIdEET_RKS2_S4_'
NVC++-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (test.cpp: 1)
NVC++/x86-64 Linux 22.7-0: compilation aborted

Fortunately, I was able to find a fairly small set of workarounds (Comparing 3.4.0...olupton/nvc++-cuda · BlueBrain/eigen · GitHub) so that nvc++ in CUDA mode can compile the functions we need for sufficiently small matrices (I believe there are still issues for matrices larger than 16x16, but that is not immediately a problem for us), but it was not exactly straightforward to generate this patch.

I can try and get these workarounds accepted upstream in Eigen, but it seems likely that similar issues will come up again insofar as community libraries like Eigen are mainly tested with nvcc, and nvc++'s CUDA support is different/more limited.

However, for the meantime we have got our application working in shared library mode, at least with OpenACC (OpenMP target offload with nvc++ appears to have more issues, which we have not yet investigated in detail).

One other question: is this the best forum for discussing issues like this with using standard math libraries such as Eigen in offloaded code? Or is there some GPU/math team at NVIDIA that we could get in touch with directly?

Best, Olli

Officially nvc++ does not support CUDA. Some support is there, but mostly what’s needed to compile Thrust for use with our stdpar implementation. Full CUDA support (sans CUDA_ARCH) is a long term goal, but wont be available for awhile. Hence it’s not too surprising that there is issues compiling a large library like Eigen.

One other question: is this the best forum for discussing issues like this with using standard math libraries such as Eigen in offloaded code? Or is there some GPU/math team at NVIDIA that we could get in touch with directly?

Eigen isn’t developed by NVIDIA so you’d need to go to the Eigen developers for this.

I can try to help with nvc++ compiler related issues, but not with Eigen issues.

-Mat