Clarification on using OpenACC in a shared library

If you have “acc routines” marked appropriately, you should not get the unresolved error at link time. Inlining should be an optimization, and not required for correctness.

If you are using our compilers on OpenPOWER, or our LLVM compiler (default since 19.1) on X86, we do not support IPA yet, so -Minline is your only option for inlining across files, as you have discovered.

To be clear, the “ptxas fatal : Unresolved extern function” error occurs at compile time, not linking. This occurs due to the nordc flag, which is where I still am not quite clear. I have found that “nordc” is still mandatory to get any acc pragmas or API calls to work. From your earlier comments this does not seem to be correct.

I’ll take another look into my build system, perhaps I am missing something.

Like I said, I am not the expert in this area and Mat is out this week, but I did make a minimal reproducer, that works, and is what I think you are trying to do:

cat main.cpp
include <stdio.h>

int testit(void);

int
main()
{
testit();
printf(“Done\n”);
return 0;
}

\

cat routine.cpp
/* one-D input/output arrays

  • routine seq
    */

include <stdio.h>
include “openacc.h”

extern float doit( float* a, int i );
#pragma acc routine(doit) seq

void
testit()
{
float a0[1000], b0[1000];
int i, n = 1000;
for( i = 0; i < n; ++i ) a0 _= (float)i;
for( i = 0; i < n; ++i ) b0 = -1;


#pragma acc parallel copy( b0, a0 )
{
#pragma acc loop
for( i = 0; i < n; ++i ) b0 = doit( a0, i );
}

for( i = 0; i < n; ++i ) if (b0 != (float(i)*float(i))) printf(“BAD\n”);

}

\

cat doit.cpp
/* one-D input/output arrays

  • routine seq
    /

    include <stdio.h>
    include “openacc.h”

    #pragma acc routine seq
    float doit( float
    a, int i )
    {
    return a*a;
    }

    \

pgc++ -c -ta=tesla -fPIC -Minfo=acc routine.cpp doit.cpp
routine.cpp:
testit():
21, Generating copy(b0[:],a0[:]) [if not already present]
Generating Tesla code
23, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
doit.cpp:
doit(float *, int):
10, Generating acc routine seq
Generating Tesla code


pgc++ -ta=tesla -shared -o r.so routine.o doit.o
g++ main.cpp r.so
export LD_LIBRARY_PATH=pwd
./a.out
Done_

Hi Brendt,

Thanks for taking a look. Indeed, this makes me think there is something unusual in the way our build system works.

For the time being I was able to refactor the code to get things working for what I need right now. I’ll circle back to this in a few months once my current obligations are out of the way.

Thanks again for your help,

-David

Sorry to revive this old thread, but I have a follow up question that is closely related.

I am using OpenACC in a code that will be distributed as a shared library, with both the library and main executable compiled and linked with PGI. For other projects I have found that it is necessary to use the -nordc option for this work correctly at runtime. The current solver relies heavily on templated structs containing static member data, similar to the code block below. These constants are accessed in almost every compute heavy loop.

<in a header file>
template <typename T> struct constants
{
        enum { d = 3 };   
        static const int c[d]; 
}

<in a header implementation file>
template<typename T> 
const T constants<T>::c[constants<T>::d] = { (T)1/(T)3, (T)1/(T)3, (T)1/(T)3 };

#pragma acc declare copyin(constants<float>::c)
#pragma acc declare copyin(constants<double>::c)

<in another header implementation file>
template <typename T, template <typename U> class TC>
void DummyClass<T, TC>::dummyMethod()
{
  <lots of parallel loops that access TC::c>
}

Without the -nordc option I confirm that I am able to offload and access this static data on the device. With -nordc I am able to compile but the contents of the static data is incorrect during device execution (it seems to be zeroed out). I recall reading somewhere that it is not possible to use the declare pragma across compilation units. Thus, I have 2 questions:

  1. Is the -nordc flag still required when utilizing OpenACC in a C++ project with both a PGI compiled host application and PGI compiled shared library?

  2. If the answer to #1 is yes, are the limitations around the declare pragma still valid? Is it recommended to declare static data separately in each compilation unit?

Thanks,

-David

Hi David,

  1. Is the -nordc flag still required when utilizing OpenACC in a C++ project with both a PGI compiled host application and PGI compiled shared library?

Late last year, our engineers were able to add RDC support when building Linux shared objects. So while you still couldn’t access these static variables outside of the shared object, within the shared object it should be fine.

Now you do need to build the shared object using the PGI drivers using the same OpenACC flags you use to compiler as well as the “-shared” (create a shared object) flag.

pgc++ test.o -acc -ta=tesla -shared -o libmylib.so

Caveat: this support is fairly new and not widely used yet, so there’s possible unknown issues.

  1. If the answer to #1 is yes, are the limitations around the declare pragma still valid? Is it recommended to declare static data separately in each compilation unit?

If using nordc, then you can’t use the “declare” directive nor “routine” where the device routine is not defined within the same file (or module for Fortran). Both require a link step in order to resolve the device symbol names where nordc disables the link.

The added support for RDC in shared objects basically runs the objects through the device linker (nvlink) and why you need to use PGI when creating the shared object.

-Mat

Hi Mat,

Thanks for the quick response. I’ll give it a try and let you know if it works. If it does work that is great news as I have another project coming down the pipe that is going to be blocked by the same problem.

-David

Hello Mat,

I have similar use case where I have to build the shared library containing global variables with declare create and I am seeing confusing behaviour. I am preparing below sample mimicking usage in our application with PGI 19.10:

$ pgc++ --version
 pgc++ 19.10-0 LLVM 64-bit target on x86-64 Linux -tp skylake

Sample :

$ cat test.cpp

#include <stdlib.h>
#include <stdio.h>

float* my_data;
#pragma acc declare create(my_data)

static double my_value = 1.0;
#pragma acc declare copyin(my_value)

double my_celsius;
#pragma acc declare create(my_celsius)

void init(double val) {
    my_data = (float*) malloc(sizeof(float)*3);
    my_data[0] = 1.1;
    #pragma acc enter data copyin(my_data[0:3])

    my_celsius = val;
    #pragma acc update device(my_celsius)
}

#pragma acc routine seq
double foo() {
    double y = 2.0;
    // 2 + 5 + 1 = 8
    return y + my_celsius + my_value;
}

void sample() {
  #pragma acc kernels present(my_celsius)
  {
     double val = foo();
     printf("\n VALUE :: %lf \n", val);
  }
}    

and driver program:

$ cat main.cpp
#include <iostream>

void init(double val);
void sample();

int main() {
    init(5.0);
    sample();
    std::cout << "Done\n";
    return 0;
}

I build this example in different ways:

$ cat build.sh
# Buil 1
pgc++ -acc -Minfo test.cpp main.cpp
./a.out

# Build 2
pgc++ -acc -ta=tesla:nordc -Minfo test.cpp -c -fPIC
pgc++ -acc -ta=tesla:nordc -shared -o test.so test.o
pgc++ -acc -ta=tesla:nordc main.cpp test.so
./a.out

# Build 3
pgc++ -acc -Minfo test.cpp -c -fPIC
pgc++ -acc -shared -o test.so test.o
pgc++ -acc main.cpp test.so
./a.out

and output we get is:

    $ export LD_LIBRARY_PATH=`pwd`:$LD_LIBRARY_PATH
    $ export PGI_ACC_NOTIFY=2
    $ bash -x build.sh
+ pgc++ -acc -Minfo test.cpp main.cpp
test.cpp:
init(double):
     18, Generating enter data copyin(my_data[:3])
     20, Generating update device(my_celsius)
foo():
     23, Generating acc routine seq
         Generating Tesla code
sample():
     31, Accelerator serial kernel generated
         Generating Tesla code
main.cpp:
+ ./a.out
upload CUDA data  file=/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp function=_Z4initd line=18 device=0 threadid=1 variable=_st__gpfs_bbp_cscs_ch_project_proj16_kumbhar_pramod_scratch_acc_lib_expr_test_cpp_my_value bytes=8
upload CUDA data  file=/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp function=_Z4initd line=18 device=0 threadid=1 variable=.attach. bytes=8
upload CUDA data  file=/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp function=_Z4initd line=18 device=0 threadid=1 variable=my_data bytes=12
upload CUDA data  file=/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp function=_Z4initd line=20 device=0 threadid=1 variable=my_celsius bytes=8

 VALUE :: 8.000000
Done
+ pgc++ -acc -ta=tesla:nordc -Minfo test.cpp -c -fPIC
init(double):
     18, Generating enter data copyin(my_data[:3])
     20, Generating update device(my_celsius)
foo():
     23, Generating acc routine seq
         Generating Tesla code
sample():
     31, Accelerator serial kernel generated
         Generating Tesla code
+ pgc++ -acc -ta=tesla:nordc -shared -o test.so test.o
+ pgc++ -acc -ta=tesla:nordc main.cpp test.so
main.cpp:
+ ./a.out
upload CUDA data  file=/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp function=_Z4initd line=18 device=0 threadid=1 variable=my_data bytes=12
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 7.0, threadid=1
host:0x615230 device:0x7fff9fafa000 size:12 presentcount:0+1 line:18 name:my_data
allocated block device:0x7fff9fafa000 size:512 thread:1
FATAL ERROR: data in update device clause was not found on device 1: name=my_celsius
 file:/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp _Z4initd line:20

+ pgc++ -acc -Minfo test.cpp -c -fPIC
init(double):
     18, Generating enter data copyin(my_data[:3])
     20, Generating update device(my_celsius)
foo():
     23, Generating acc routine seq
         Generating Tesla code
sample():
     31, Accelerator serial kernel generated
         Generating Tesla code
+ pgc++ -acc -shared -o test.so test.o
+ pgc++ -acc main.cpp test.so
main.cpp:
+ ./a.out
Failing in Thread:1
call to cuModuleGetGlobal returned error 500: Not found

In summary, this is what I am seeing:

  • Build 1 : Everything works fine if we compile everything as single executable without any library (or create static library)
  • Build 2 : If we create shared library with nordc then we get data in update device clause was not found… error. Why is that? How to get around this? If I comment out acc update device(my_celsius) in init() function then I don’t see this error. But I don’t understand without updating value on device, why this works. Is that because global variable gets copied on kernel launch? (but I don’t see that with PGI_ACC_DEBUG). (Edit : as mentioned in the next comment, this doesn’t work)
  • Build 3 : After reading previous comments my impression was this should work fine without nordc using newer PGI version. But we get cuModuleGetGlobal returned error 500. This means we always have to use nordc with shared library?

Using shared library build is critical for our application and it’s difficult to get rid of global variables because the legacy code is auto-generated from some DSL layer.

If you have any suggestions to make Build 2 / Build 3 work, it will be great help!

I read other threads on the forum and they give quite useful information:

It seems like shared library + nordc + global/static variables is no go! Here is modified example from previous that compiles but the pointers on device are NULL. This is quite confusing :

float* my_data;
#pragma acc declare create(my_data)

void init(double val) {
    my_data = (float*) malloc(sizeof(float)*3);
    my_data[0] = 1.1;
    #pragma acc enter data create(my_data[0:3])
}

#pragma acc routine seq
void print() {
    printf("--> data %p\n", my_data);
}

void sample() {
  #pragma acc parallel loop
  for(int i=0; i<3; i++)
  {
        print();
  }
}

and compiling & running fives:

+ pgc++ -acc -ta=tesla:nordc -Minfo ext.cpp -c -fPIC
y_square():
      2, Generating acc routine seq
         Generating Tesla code
+ pgc++ -acc -ta=tesla:nordc -Minfo test.cpp -c -fPIC
init(double):
     11, Generating enter data create(my_data[:3])
print():
     14, Generating acc routine seq
         Generating Tesla code
sample():
     18, Generating Tesla code
         20, #pragma acc loop gang, vector(3) /* blockIdx.x threadIdx.x */
+ pgc++ -acc -ta=tesla:nordc -shared -o test.so test.o
+ pgc++ -acc -ta=tesla:nordc main.cpp test.so
main.cpp:
+ ./a.out
--> data (nil)
--> data (nil)
--> data (nil)

my_data being nil?

(The shared library aspects are quite important for the application. Also, it’s difficult to change global/static variables + routine seq due to legacy DSL layer)

If I have global variables like above, is there any other way (even convoluted) to get this working?

Hi Pramod,

It’s my understanding that the first example should work, however this support is relatively new so there may be issues. I was able to reproduce the error with the latest NVC++ compilers so filed a bug report, TPR #28720, and sent it to our engineers for further review.

-Mat

Hi Mat, Hi Pramod,

I confirm on my end that our application works. We compile both the shared library and the parent executable with the following (without nordc):

-acc -ta=tesla:nofma -Minfo=acc -tp=px -O

The nofma and -O flags were necessary to exactly match GCC CPU results, and it seems to only have a limited impact on performance for this particular application.

However, I tried with another one of our other C++ solvers and I encounter problems similar to what Pramod mentioned for case 3. I don’t have time dig into this deeper right now but I will when I have a chance.

I should add to the chorus that this type of setup (shared library + parent executable) is going to be the norm for all of our GPU accelerated software going forward. Do you have an idea of when TPR 28720 will make it into a release?

Thanks,

-David

Thanks David. My suspicion is that its a C++ issue since I’ve had success, like you, with C, but we’ll need a compiler engineer to dig into the issue before I can say for sure.

Since I just submitted the issue, it’s way too early to know when a fix would be available. I’ll advocate to raise the priority level but that’s not a guarantee that it can be fixed any time soon.

-Mat

Thanks Mat,

I’ll keep you posted as if/when I encounter any related issues.

-David

Thank you very much David and Mat for providing quick feedback.

If there is anything I should try or experiment with, let me know.

Just for the context, I am integrating GPU support in NEURON framework and integrating static libraries is not straightforward (or at least doesn’t match with how large number of users use this software). So any help / feedback will be highly appreciated.

Dear Mat,

Just to check, was there any response/suggestion from compiler dev team?

Thank you!

No, sorry, nothing yet.

Hi Mat, I’m just checking in on this issue to see if there has been any progress or news from the compiler dev team. I’ve been tasked with reworking two of our standalone solvers as shared libraries with a very tight deadline. I started today and ran into a bunch of issues with both PGI 19.10 and the latest NVIDIA HPC SDK (20.9). I’ll try to reproduce these issues in a simplified manner tomorrow but hopefully there are already some fixes in the pipeline.

During OpenACC summit few months ago I asked about this issue (Michael Wolfe and Mat were there during the breakout session). I am wondering if compiler dev team has any feedback or workaround solution on this.

If there is no ETA for resolution of this issue, I am wondering if we should give up the shared library approach and use static library instead (which going to impact our user workflows).

Hi Pramod,

The work around would be to not use RDC (i.e. -gpu=nordc) with shared objects. However without RDC, the code can’t contain any static global references in “declare” directives, like extern variables, nor allows for cross-file device routine calls. So you’d need to reorganize the code so “my_data” is passed in as an argument, rather a global reference.

For example:

% cat test1.cpp
#include <stdlib.h>
#include <stdio.h>

void init(double val, float** my_data, double & my_celsius) {
    *my_data = (float*) malloc(sizeof(float)*3);
    (*my_data)[0] = 1.1;
    float * tmp = *my_data;
    #pragma acc enter data copyin(tmp[0:3])

    my_celsius = val;
    #pragma acc enter data copyin(my_celsius)
}

#pragma acc routine seq
double foo (double my_celsius,  double my_value) {
    double y = 2.0;
    // 2 + 5 + 1 = 8
    return y + my_celsius + my_value;
}

double sample(double & my_celsius,double my_value) {
  double val;
  #pragma acc serial present(my_celsius) copyout(val)
  {
      val = foo(my_celsius, my_value);
  }
  return val;
}
dev-sky5:/local/home/colgrove/tmp% cat main.cpp
#include <iostream>

void init(double val, float** my_data, double & my_celsius);
double sample(double &my_celsius,double my_value);
static double my_value = 1.0;

int main() {

    float * my_data;
    double my_celsius;
    init(5.0,&my_data,my_celsius);
    double val = sample(my_celsius,my_value);
    std::cout << "Val=" << val << std::endl;
    std::cout << "Done\n";
    return 0;
}
% nvc++ -acc -Minfo -gpu=cc70 test1.cpp -c -fPIC
init(double, float **, double &):
     10, Generating enter data copyin(tmp[:3])
     12, Generating enter data copyin(my_celsius[:1])
foo(double, double):
     15, Generating acc routine seq
         Generating Tesla code
sample(double &, double):
     24, Generating present(my_celsius[:1])
         Generating copyout(val) [if not already present]
         Accelerator serial kernel generated
         Generating Tesla code
% nvc++ -acc -gpu=cc70 -shared -o test.so test1.o
% nvc++ -acc main.cpp test.so -gpu=cc70
main.cpp:
% a.out
Val=8
Done

-Mat

Thanks Mat for providing an example.

In our use case currently its difficult to change the code structure because its generated from DSL to C++ transpiler and difficult to change API. But I will keep this in mind.