Acc_deviceptr does not work in OpenACC code dynamically loaded from a shared library

We are currently working on making the GPU-enabled version of an application dynamically loadable from Python, which implies bundling GPU-enabled code in a shared library that is dlopen’d by an executable (Python) that is not linked against any OpenACC or OpenMP runtime libraries.

When operating in this mode, we are seeing problems with (at least) the acc_deviceptr API routine, which is simply returning the (host) pointer passed to it instead of mapping it to a device pointer.

This is quite easy to reproduce in a standalone example:

$ cat main.cpp
#include <dlfcn.h>
#include <stdexcept>
int main() {
  void* h = dlopen("./", RTLD_NOW);
  if(!h) { throw std::runtime_error{(std::string{"dlopen: "} + dlerror()).c_str()}; }
  auto* openacc = reinterpret_cast<void(*)()>(dlsym(h, "openacc"));
  if(!openacc) { throw std::runtime_error{(std::string{"dlsym: "} + dlerror()).c_str()}; }
  return 0;


$ cat lib.cpp
#include <openacc.h>
#include <iostream>
extern "C" void openacc() {
  constexpr auto data_size = 1024;
  auto* data = new double[data_size];
  auto* d_data = static_cast<double*>(acc_copyin(data, data_size*sizeof(double)));
  std::cout << "data = " << data << ", d_data = " << d_data << ", acc_deviceptr(data) = " << acc_deviceptr(data) << std::endl;

compiled with

nvc++ -acc -gpu=nordc -shared -o lib.cpp
g++ -ldl -o main main.cpp


$ ./main
data = 0x41cca0, d_data = 0x7fff9bafa000, acc_deviceptr(data) = 0x41cca0

i.e. acc_copyin returns a plausible device pointer, but later calling acc_deviceptr returns the host pointer.

I am using nvc++ 22.3, and this does not appear to be correct behaviour. Is this a known issue, or is there any easy workaround? We can try to propagate the return value of acc_copyin manually to where it’s needed, but this may prove to be a challenge in the real application.

Other observations:

  • I included the -gpu=nordc option based on previous replies, but removing this option does not affect the results.
  • I tried a #pragma acc host_data use_device(data) block: this behaves the same way as acc_deviceptr, returning the host address
  • Printing data from a device kernel with #pragma acc kernels present(data) returns the correct device pointer.
  • If I set NVCOMPILER_ACC_NOTIFY=31 and add acc_delete(data, data_size*sizeof(double)) then I see a message like delete CUDA data devaddr=0x7fff9bafa000 bytes=8192 device=0 threadid=1 containing the correct device pointer.

Please let me know if I can provide any other information.

1 Like

Thanks Olli,

This is a known issue (TPR #29016). The good news is that it looks like this has been fixed in our upcoming 22.5 release.

I do need to mention that there is a small chance that testing reveals an issue and the fix needs to be backed out. Hence, I can’t guarantee it will be in the final release. Though, I’ll add a note to TPR #29016 to double check once 22.5 is releases and post an confirmation.

% nvc++ -acc -gpu=nordc -shared -o lib.cpp -V22.3; ./main
data = 0x557372289b50, d_data = 0x14b6992fa000, acc_deviceptr(data) = 0x557372289b50
% nvc++ -acc -gpu=nordc -shared -o lib.cpp -V22.5 ; ./main
data = 0x562fae5eeb50, d_data = 0x1464212fa000, acc_deviceptr(data) = 0x1464212fa000


Hello Mat,

As this is only happening in the case of dynamically loaded shared libraries, I am curious if there is any way / workaround to trigger correct behaviour.

The reason I am asking is that getting 22.5 out and available on different HPC systems might take some time, if there is a workaround then we can implement such in our application temporarily.

Thank you very much again!

Linking with nvc++ instead of g++ is one work around, but that wont help here since you’re calling the library from Python.

The second work around would be to launch a serial kernel to capture the device address. Something like the following. Note the use of “size_t” instead of a pointer which makes it easier to copy back from the GPU.

% cat lib.cpp
#include <openacc.h>
#include <iostream>
extern "C" void openacc() {
  constexpr auto data_size = 1024;
  auto* data = new double[data_size];
  auto* d_data = static_cast<double*>(acc_copyin(data, data_size*sizeof(double)));
  size_t dptr;
#pragma acc serial present(data) copyout(dptr)
     dptr = (size_t) data;
  std::cout << "data = " << data << ", d_data = " << d_data << ", acc_deviceptr(data) = " << acc_deviceptr(data) << std::endl;
  std::cout << "data = " << data << ", d_data = " << d_data << ", dptr = " << (double*)dptr << std::endl;
% nvc++ -acc -gpu=nordc -shared -o lib.cpp -V22.3
% ./main
data = 0x5615ba735b80, d_data = 0x14e0332fa000, acc_deviceptr(data) = 0x5615ba735b80
data = 0x5615ba735b80, d_data = 0x14e0332fa000, dptr = 0x14e0332fa000

Hope this helps,

Hi Olli, Pramod,

FYI, 22.5 was released today and I doubled checked that the fix did get included.