We are currently working on making the GPU-enabled version of an application dynamically loadable from Python, which implies bundling GPU-enabled code in a shared library that is dlopen
’d by an executable (Python) that is not linked against any OpenACC or OpenMP runtime libraries.
We understand from the forums that this is not expected to work with OpenMP target offload yet (Creating a shared library that utilises OpenMP offloading) so, while we would like to explore this with OpenMP, we are currently focusing on OpenACC.
I recently posted: Dynamically loading an OpenACC-enabled shared library from an executable compiled with nvc++ does not work on a similar topic. In that post we have a simple example working where the offloaded kernel is pure OpenACC. Unfortunately in our real application, we also need to call functions that are defined in CUDA code from OpenACC regions. This is because some external library routines that we use (Eigen, Random123) cannot be reliably used “inline” in OpenACC/OpenMP regions, but are compilable by nvcc
for GPU execution.
All of this is currently working with nvc++
if we link the various CUDA/OpenACC object files statically into an executable and launch that; the problems arise when we try and build a shared library and either dynamically link it to an executable, or open it with dlopen
.A small standalone example consists of this CUDA code:
A small standalone example consists of this CUDA code:
$ cat cuda.cu
#include <cstdio>
__host__ __device__ void foo() {
#ifdef __CUDA_ARCH__
std::printf("hello from device\n");
#else
std::printf("hello from host\n");
#endif
}
and this OpenACC-annotated C++ code:
$ cat directives.cpp
#pragma acc routine seq
void foo();
extern "C" int launch() {
#pragma acc kernels
{
foo();
}
return 0;
}
which we build and link into a shared library as follows:
nvc++ -acc -cuda -gpu=cc70,debug -g -fPIC -o directives.o -c directives.cpp
nvcc --generate-code=arch=compute_70,code=[compute_70,sm_70] -g -G -Xcompiler=-fPIC -o cuda.o -dc cuda.cu
nvc++ -acc -cuda -gpu=cc70,debug -g -shared -o libshared_acc_rdc.so directives.o cuda.o
where we are relying on nvc++
to do the device code linking, rather than including an explicit nvcc -dlink
step, following the discussion in: Separate compilation of mixed CUDA OpenACC code.
We dynamically load this library with the following executable:
$ cat dynamic.cpp
#include <dlfcn.h>
#include <stdexcept>
using launch_t = int(*) ();
int main() {
void* h = dlopen("./libshared_acc_rdc.so", RTLD_NOW);
if(!h) { throw std::runtime_error{dlerror()}; }
auto* launch = reinterpret_cast<launch_t>(dlsym(h, "launch"));
if(!launch) { throw std::runtime_error{dlerror()}; }
return launch();
}
g++ -ldl -o main_gcc dynamic.cpp
All of the above works without any warnings/errors, but actually executing main_gcc
segfaults with
(gdb) bt
#0 0x00007fffebf81981 in __pgi_cuda_register_fat_binaryA (fatCubin0=0x7fffedaf55ac <__nv_cudaEntityRegisterCallback(void**)>, pgi_cuda_loc=0x0) at /proj/build/21C/Linux_x86_64/rte/cudafor/src/cudaregister.c:157
#1 0x00007fffebf81b33 in __pgi_cuda_register_fat_binary (fatCubin=0x7fffedaf55ac <__nv_cudaEntityRegisterCallback(void**)>) at /proj/build/21C/Linux_x86_64/rte/cudafor/src/cudaregister.c:217
#2 0x00007fffedaf5205 in __cudaRegisterLinkedBinary_b140de3b_7_cuda_cu__Z3foov () from ./libshared_acc_rdc.so
#3 0x00007fffedaf5603 in __sti____cudaRegisterAll () at /path/to/tmp/tmpxft_0001a6b4_00000000-6_cuda.cudafe1.stub.c:14
#4 0x00007fffed8f29c3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#5 0x00007fffed8f759e in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#6 0x00007fffed8f27d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#7 0x00007fffed8f6b8b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#8 0x00007fffed6dffab in dlopen_doit () from /lib64/libdl.so.2
#9 0x00007fffed8f27d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#10 0x00007fffed6e05ad in _dlerror_run () from /lib64/libdl.so.2
#11 0x00007fffed6e0041 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#12 0x00000000004011e0 in main ()
which appears to be something to do with the device linking of CUDA code done by nvc++
(?).
A brief summary of other, unsuccessful experiments:
- Not actually calling
foo()
from OpenACC makes no difference, it still crashes as above. - Adding an explicit device-linking step (
nvcc --generate-code=arch=compute_70,code=[compute_70,sm_70] -g -G -Xcompiler=-fPIC -dlink -o dlink.o directives.o cuda.o
) and adding the result to the shared library (nvc++ -acc -cuda -gpu=cc70,debug -g -shared -o libshared_acc_rdc.so directives.o cuda.o dlink.o
) causesdlopen
to fail withundefined symbol: __fatbinwrap_14directives_cpp
, which seems consistent with: Separate compilation of mixed CUDA OpenACC code. - Disabling relocatable device code (
-gpu=nordc
fornvc++
and replacing-dc
with-dw
fornvcc
) causesdirectives.cpp
to fail to compile withptxas fatal : Unresolved extern function '_Z3foov'
(even if I appendcuda.o
to the command line). This experiment was inspired by: problem of openacc compiled shared lib on linux.
Note that the main
function that calls dlopen
is compiled with g++
, not nvc++
, because of this issue: Dynamically loading an OpenACC-enabled shared library from an executable compiled with nvc++ does not work.
Is there anything else we can try to create a dynamically loadable library containing both OpenACC and CUDA code?