Dynamically loading an OpenACC-enabled shared library from an executable compiled with nvc++ does not work

We are currently working on making the GPU-enabled version of an application dynamically loadable from Python, which implies bundling GPU-enabled code in a shared library that is dlopen’d by an executable (Python) that is not linked against any OpenACC or OpenMP runtime libraries.

When working on a minimal example of this using OpenACC and NVC++ 22.2, I came across the following strange behaviour

If I build a shared library using OpenACC:

% nvc++ -acc -cuda -gpu=cc70,lineinfo -fPIC -o minimal_directives.o -c minimal_directives.cpp
% nvc++ -acc -cuda -gpu=cc70,lineinfo -fPIC -shared -o libshared.so minimal_directives.o
% cat minimal_directives.cpp
extern "C" int launch() {
  int x{3};
  #pragma acc kernels copy(x)
  {
    x = 42;
  }
  return x != 42;
}

And use nvc++ to build a minimal driver executable:

% nvc++ -o main minimal_main.cpp
% cat minimal_main.cpp
#include <dlfcn.h>
#include <iostream>
using launch_t = int(*) ();
int main() {
  void* h = dlopen("./libshared.so", RTLD_NOW);
  if(!h) { throw std::runtime_error(dlerror()); }
  auto* launch = reinterpret_cast<launch_t>(dlsym(h, "launch"));
  if(!launch) { throw std::runtime_error(dlerror()); }
  return launch();
}

Then the executable aborts:

% ./main
Current file:     /gpfs/bbp.cscs.ch/home/olupton/nvhpc-gpu-code-in-shared-library/minimal_directives.cpp
        function: launch
        line:     4
This file was compiled: -acc=gpu -gpu=cc70
% echo $?
1

But if I build the same code using g++, it works:

% g++ -ldl -o main_gcc minimal_main.cpp
% ./main_gcc
% echo $?
0

the compiler versions are:

% nvc++ --version
nvc++ 22.2-0 64-bit target on x86-64 Linux -tp skylake-avx512
...
% g++ --version
g++ (Spack GCC) 11.2.0
...

but I believe this is the same in 22.3.

If I remove the -cuda option from the nvc++ invocations, it works to load the shared library from the main executable. -cuda is not needed in this minimal example, but in the real application then we are using some CUDA APIs and calling some functions defined in CUDA code from OpenACC regions, so we assume that -cuda will be needed. I have not yet managed to produce a dynamically loadable shared library that mixes OpenACC and CUDA code, but that will be a topic for a different thread.

This behaviour seems surprising to me. Hopefully it will not be a blocker, as Python will typically not be compiled with nvc++ , but it seemed worth flagging.

Hi Olli,

Try compiling the code being used in the shared library with “-gpu=nordc”. Relocatable Device Code (RDC) requires the code to be linked using the device linker. While we’ve been able to add a device linking step for shared object with C and Fortran, this support is not yet available in C++ based shared objects. Disabling RDC will allow it work as expected.

Note that a few OpenACC features such as using global variables in a “declare” directive or calling routines across files require RDC so you wont be able to use these features.

Hope this helps,
Mat

Hi Mat,

Thanks for the quick reply! Unfortunately adding the nordc option doesn’t seem to make any difference here.
I am now compiling with:

$ nvc++ -acc -cuda -gpu=cc70 -gpu=nordc -fPIC -o minimal_directives.o -c minimal_directives.cpp
$ nvc++ -acc -cuda -gpu=cc70 -gpu=nordc -fPIC -shared -o libshared.so minimal_directives.o

but building the executable that calls dlopen with nvc++ still produces the same result:

$ nvc++ -o main minimal_main.cpp 
$ ./main
Current file:     /gpfs/bbp.cscs.ch/home/olupton/nvhpc-gpu-code-in-shared-library/minimal_directives.cpp
        function: launch
        line:     4
This file was compiled: -acc=gpu -gpu=cc70
$ echo $?
1

and building it with g++ still works:

$ g++ -ldl -o main_gcc minimal_main.cpp
$ ./main_gcc
$ echo $?
0

Let me know if there’s anything else useful I can try! Note that this is now nvc++ 22.3 as we now have that deployed on our system.

Best, Olli

One other clarification about my original message. I said that removing -cuda fixed things, but I now realise that was only partially true. If I build like this, without -cuda:

$ nvc++ -acc -gpu=cc70 -fPIC -o minimal_directives.o -c minimal_directives.cpp
$ nvc++ -acc -gpu=cc70 -fPIC -shared -o libshared.so minimal_directives.o
$ nvc++ -o main minimal_main.cpp

then main does not produce an error message or failure code

$ ./main
$ echo $?
0

but it does not actually execute on the device either:

$ nvprof ./main
==89590== NVPROF is profiling process 89590, command: ./main
==89590== Profiling application: ./main
==89590== Profiling result:
No kernels were profiled.
No API activities were profiled.

the executable compiled with g++ still works as expected:

$ g++ -ldl -o main_gcc minimal_main.cpp
$ nvprof --openacc-profiling off ./main_gcc
==89759== NVPROF is profiling process 89759, command: ./main_gcc
==89759== Profiling application: ./main_gcc
==89759== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   40.56%  3.2320us         1  3.2320us  3.2320us  3.2320us  launch_4_gpu
                   36.55%  2.9120us         1  2.9120us  2.9120us  2.9120us  [CUDA memcpy DtoH]
                   22.89%  1.8240us         1  1.8240us  1.8240us  1.8240us  [CUDA memset]

adding -gpu=nordc when creating the shared library does not affect this conclusion.

Best, Olli

Add “-acc” to the nvc++ link and then it will work as expected.

This issue is that when linked with nvc++, the runtime gets initialized without OpenACC unless “-acc” is added. Since the compiler runtime is shared with the same runtime as used by the shared library, when it gets to the OpenACC section, the OpenACC code fails. With g++, the runtime initialization is delayed until the shared object is loaded.

% setenv NV_ACC_TIME=1
% nvc++ -acc -cuda -gpu=lineinfo,nordc -fPIC -o minimal_directives.o -c minimal_directives.cpp
% nvc++ -acc -cuda -gpu=lineinfo,nordc -fPIC -shared -o libshared.so minimal_directives.o
% nvc++ -o main minimal_main.cpp
% ./main
Current file:     /local/home/mcolgrove/minimal_directives.cpp
        function: launch
        line:     4
This file was compiled: -acc=gpu -gpu=cc80
% nvc++ -o main minimal_main.cpp -acc
% ./main

Accelerator Kernel Timing data
/local/home/mcolgrove/minimal_directives.cpp
  launch  NVIDIA  devicenum=0
    time(us): 56
    4: compute region reached 1 time
        4: kernel launched 1 time
            grid: [1]  block: [1]
             device time(us): total=5 max=5 min=5 avg=5
            elapsed time(us): total=308 max=308 min=308 avg=308
    4: data region reached 2 times
        4: data copyin transfers: 1
             device time(us): total=8 max=8 min=8 avg=8
        6: data copyout transfers: 1
             device time(us): total=43 max=43 min=43 avg=43

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.