Dynamically loading an OpenACC-enabled shared library from an executable compiled with nvc++ does not work

olupton · April 12, 2022, 11:05am

We are currently working on making the GPU-enabled version of an application dynamically loadable from Python, which implies bundling GPU-enabled code in a shared library that is dlopen’d by an executable (Python) that is not linked against any OpenACC or OpenMP runtime libraries.

When working on a minimal example of this using OpenACC and NVC++ 22.2, I came across the following strange behaviour

If I build a shared library using OpenACC:

% nvc++ -acc -cuda -gpu=cc70,lineinfo -fPIC -o minimal_directives.o -c minimal_directives.cpp
% nvc++ -acc -cuda -gpu=cc70,lineinfo -fPIC -shared -o libshared.so minimal_directives.o
% cat minimal_directives.cpp
extern "C" int launch() {
  int x{3};
  #pragma acc kernels copy(x)
  {
    x = 42;
  }
  return x != 42;
}

And use nvc++ to build a minimal driver executable:

% nvc++ -o main minimal_main.cpp
% cat minimal_main.cpp
#include <dlfcn.h>
#include <iostream>
using launch_t = int(*) ();
int main() {
  void* h = dlopen("./libshared.so", RTLD_NOW);
  if(!h) { throw std::runtime_error(dlerror()); }
  auto* launch = reinterpret_cast<launch_t>(dlsym(h, "launch"));
  if(!launch) { throw std::runtime_error(dlerror()); }
  return launch();
}

Then the executable aborts:

% ./main
Current file:     /gpfs/bbp.cscs.ch/home/olupton/nvhpc-gpu-code-in-shared-library/minimal_directives.cpp
        function: launch
        line:     4
This file was compiled: -acc=gpu -gpu=cc70
% echo $?
1

But if I build the same code using g++, it works:

% g++ -ldl -o main_gcc minimal_main.cpp
% ./main_gcc
% echo $?
0

the compiler versions are:

% nvc++ --version
nvc++ 22.2-0 64-bit target on x86-64 Linux -tp skylake-avx512
...
% g++ --version
g++ (Spack GCC) 11.2.0
...

but I believe this is the same in 22.3.

If I remove the -cuda option from the nvc++ invocations, it works to load the shared library from the main executable. -cuda is not needed in this minimal example, but in the real application then we are using some CUDA APIs and calling some functions defined in CUDA code from OpenACC regions, so we assume that -cuda will be needed. I have not yet managed to produce a dynamically loadable shared library that mixes OpenACC and CUDA code, but that will be a topic for a different thread.

This behaviour seems surprising to me. Hopefully it will not be a blocker, as Python will typically not be compiled with nvc++ , but it seemed worth flagging.

MatColgrove · April 12, 2022, 5:20pm

Hi Olli,

Try compiling the code being used in the shared library with “-gpu=nordc”. Relocatable Device Code (RDC) requires the code to be linked using the device linker. While we’ve been able to add a device linking step for shared object with C and Fortran, this support is not yet available in C++ based shared objects. Disabling RDC will allow it work as expected.

Note that a few OpenACC features such as using global variables in a “declare” directive or calling routines across files require RDC so you wont be able to use these features.

Hope this helps,
Mat

olupton · April 13, 2022, 5:29am

Hi Mat,

Thanks for the quick reply! Unfortunately adding the nordc option doesn’t seem to make any difference here.
I am now compiling with:

$ nvc++ -acc -cuda -gpu=cc70 -gpu=nordc -fPIC -o minimal_directives.o -c minimal_directives.cpp
$ nvc++ -acc -cuda -gpu=cc70 -gpu=nordc -fPIC -shared -o libshared.so minimal_directives.o

but building the executable that calls dlopen with nvc++ still produces the same result:

$ nvc++ -o main minimal_main.cpp 
$ ./main
Current file:     /gpfs/bbp.cscs.ch/home/olupton/nvhpc-gpu-code-in-shared-library/minimal_directives.cpp
        function: launch
        line:     4
This file was compiled: -acc=gpu -gpu=cc70
$ echo $?
1

and building it with g++ still works:

$ g++ -ldl -o main_gcc minimal_main.cpp
$ ./main_gcc
$ echo $?
0

Let me know if there’s anything else useful I can try! Note that this is now nvc++ 22.3 as we now have that deployed on our system.

Best, Olli

olupton · April 13, 2022, 8:06am

One other clarification about my original message. I said that removing -cuda fixed things, but I now realise that was only partially true. If I build like this, without -cuda:

$ nvc++ -acc -gpu=cc70 -fPIC -o minimal_directives.o -c minimal_directives.cpp
$ nvc++ -acc -gpu=cc70 -fPIC -shared -o libshared.so minimal_directives.o
$ nvc++ -o main minimal_main.cpp

then main does not produce an error message or failure code

$ ./main
$ echo $?
0

but it does not actually execute on the device either:

$ nvprof ./main
==89590== NVPROF is profiling process 89590, command: ./main
==89590== Profiling application: ./main
==89590== Profiling result:
No kernels were profiled.
No API activities were profiled.

the executable compiled with g++ still works as expected:

$ g++ -ldl -o main_gcc minimal_main.cpp
$ nvprof --openacc-profiling off ./main_gcc
==89759== NVPROF is profiling process 89759, command: ./main_gcc
==89759== Profiling application: ./main_gcc
==89759== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   40.56%  3.2320us         1  3.2320us  3.2320us  3.2320us  launch_4_gpu
                   36.55%  2.9120us         1  2.9120us  2.9120us  2.9120us  [CUDA memcpy DtoH]
                   22.89%  1.8240us         1  1.8240us  1.8240us  1.8240us  [CUDA memset]

adding -gpu=nordc when creating the shared library does not affect this conclusion.

Best, Olli

MatColgrove · April 13, 2022, 4:25pm

Add “-acc” to the nvc++ link and then it will work as expected.

This issue is that when linked with nvc++, the runtime gets initialized without OpenACC unless “-acc” is added. Since the compiler runtime is shared with the same runtime as used by the shared library, when it gets to the OpenACC section, the OpenACC code fails. With g++, the runtime initialization is delayed until the shared object is loaded.

% setenv NV_ACC_TIME=1
% nvc++ -acc -cuda -gpu=lineinfo,nordc -fPIC -o minimal_directives.o -c minimal_directives.cpp
% nvc++ -acc -cuda -gpu=lineinfo,nordc -fPIC -shared -o libshared.so minimal_directives.o
% nvc++ -o main minimal_main.cpp
% ./main
Current file:     /local/home/mcolgrove/minimal_directives.cpp
        function: launch
        line:     4
This file was compiled: -acc=gpu -gpu=cc80
% nvc++ -o main minimal_main.cpp -acc
% ./main

Accelerator Kernel Timing data
/local/home/mcolgrove/minimal_directives.cpp
  launch  NVIDIA  devicenum=0
    time(us): 56
    4: compute region reached 1 time
        4: kernel launched 1 time
            grid: [1]  block: [1]
             device time(us): total=5 max=5 min=5 avg=5
            elapsed time(us): total=308 max=308 min=308 avg=308
    4: data region reached 2 times
        4: data copyin transfers: 1
             device time(us): total=8 max=8 min=8 avg=8
        6: data copyout transfers: 1
             device time(us): total=43 max=43 min=43 avg=43

system · April 27, 2022, 4:26pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Missing cuda device code when trying to link nvc object file with gcc nvc, nvc++ and nvfortran	3	1197	March 4, 2022
Cannot dynamically load a shared library containing both OpenACC and CUDA code nvc, nvc++ and nvfortran	8	2636	August 24, 2022
Nvc++ OpenACC runtime segfaults if Intel MKL (numpy) is already loaded nvc, nvc++ and nvfortran	8	1252	October 7, 2023
Can an OpenACC accelerated shared object contain cpu and gpu code both? nvc, nvc++ and nvfortran	3	258	April 30, 2024
Nvc++ -stdpar functionality possible without single compilation unit? host linker? nvc, nvc++ and nvfortran	4	734	December 30, 2022
Openacc Example running slower with GPU nvc, nvc++ and nvfortran	7	890	June 19, 2022
cudaGetSymbolAddress error when mixing OpenACC and shared libraries nvc, nvc++ and nvfortran	1	477	July 14, 2022
Can the program compiled with NVCC run a machine without the GPU card? CUDA Programming and Performance	15	10858	June 15, 2011
NVCC Compile Shared Library CUDA Programming and Performance	10	1764	October 12, 2021
Acc_deviceptr does not work in OpenACC code dynamically loaded from a shared library nvc, nvc++ and nvfortran	4	912	May 25, 2022

Dynamically loading an OpenACC-enabled shared library from an executable compiled with nvc++ does not work

Related topics