Clarification on using OpenACC in a shared library

Hello,

I am reworking an existing standalone C/C++/OpenACC CFD solver as a shared library for inclusion in a larger application. From a high level, it looks like this:
main application, compiled and linked with gcc or PGI.
– geometry, mesh, graphics, etc libraries accessed with dlopen, all compiled with gcc.
– solver library accessed with dlopen, compiled with PGI and OpenACC support.

A first test with OpenACC disabled was successful, the solver library ran as expected. However, when I compile with OpenACC enabled I encounter some isses. The solver starts up as expected and nvidia-smi shows the expected number of MPI processes assigned to the targeted device (set with CUDA_VISIBLE_DEVICES). However, the application hangs at the first acc API call (acc_get_memory() in this case).
I have seen other postings indicating that the -ta=tesla:nordc flag must be used when preparing a shared library. Is this still the recommended solution for PGI 19.X? This is unfortunately a problem for my solver, which is a big C/C++ monster with extensive use of acc routine and global variables.
A fallback would be to link in the solver statically but that would go against the larger design philosophy of this application. I also intend to try inlining but it is going to be a lot of inlined code which makes me a bit wary.

Thanks for you help,
-David

The issue requiring nordc in shared libraries was addressed in PGI 19.1, so if you use any 2019 PGI compiler, you should not have to worry about that.

Can you link with pgc++? We will put the proper init section in, in that case. If not, I think there is still a way. Mat is the expert in this area, unfortunately he is out this week. I’ll do a little checking.

Thanks for the response.

Here is where I currently stand:

If I compile the solver library without nordc and link with pgc++ there is a hang at the first acc_* API call. If I compile and link with pgc++ with nordc the acc_* API calls work. However, the solver library then crashes at the first parallel region. I had to strip out a lot of the code to make this compile at all with nordc, so it is possible I broke something on the way. I’ll continue with this and return to this thread when I have more details.

Could you elaborate on what you mean by “We will put the proper init section in”. Is this somehow related to acc_init()? When you say “link with pgc++”, are you referring to the solver library or the parent application?

-David

I mean the parent application. An init section is something that gets called when the program is loaded, before “main” is called.

Another update:

After some more trial and error I was able to get acc_ API calls and a few simple parallel loops working from the solver library if I do the following:

  • Remove all pragma acc declare statements for global variables
  • Compile and link the solver library with pgc++ using the nordc option.
  • Link the parent application with pgc++

The nordc argument seems to be critical, at least for my application. I’m building with PGI 19.4, do you think it is worth upgrading to a newer version?

Moving on to pragma acc routine: is there a way to force pgc++ to inline specific functions via a preprocessing macro? I’m dealing with a complicated make system that makes it messy to force inlining through the -Minline command line arguments.

Thanks,

-David

Yeah, that makes sense. The global data in declare directives is a case of what needs to get setup at program init time, which we can handle when you link with pgc++.

These are user functions you are trying to inline, correct? Not in header files…

Yes, these are user functions/methods. Some defined in header files, which seem to work, and some in source (*.C) files which are giving me the problems.

I should also note that some of these routines are nested.

For example:
I have class A where the parallelism is exposed. Class A hosts a pointer to class B, which in turn calls a utility function implemented elsewhere. A system of create/attach methods offloads everything to the GPU and keeps the pointer tree up to date.

–in file A.C–
#pragma acc parallel loop present(B)
for (i=1, i<n; i++)
{
B->foo(i)
}

–in file B.H–
class B
{
#pragma acc routine seq
void foo(int i)
}

–in file B.C–
B::foo(int i)
{
C->foo(i)
}

–in file util.H–
#pragma acc routine seq
void utility(int i)

–in file util.C–
utility(int i)
{

}

As currently configured, I encounter compile errors like:
ptxas fatal : Unresolved extern function ‘Z9utilityRKiPKdRS1_RiS4

I suppose the path around this would be to first create an inline library with -Mextract, followed by compiling with -Minline=lib:. Is this correct?

If you have “acc routines” marked appropriately, you should not get the unresolved error at link time. Inlining should be an optimization, and not required for correctness.

If you are using our compilers on OpenPOWER, or our LLVM compiler (default since 19.1) on X86, we do not support IPA yet, so -Minline is your only option for inlining across files, as you have discovered.

To be clear, the “ptxas fatal : Unresolved extern function” error occurs at compile time, not linking. This occurs due to the nordc flag, which is where I still am not quite clear. I have found that “nordc” is still mandatory to get any acc pragmas or API calls to work. From your earlier comments this does not seem to be correct.

I’ll take another look into my build system, perhaps I am missing something.

Like I said, I am not the expert in this area and Mat is out this week, but I did make a minimal reproducer, that works, and is what I think you are trying to do:

cat main.cpp
#include <stdio.h>

int testit(void);

int
main()
{
testit();
printf(“Done\n”);
return 0;
}

\

cat routine.cpp
/* one-D input/output arrays

  • routine seq
    */

#include <stdio.h>
#include “openacc.h”

extern float doit( float* a, int i );
#pragma acc routine(doit) seq

void
testit()
{
float a0[1000], b0[1000];
int i, n = 1000;
for( i = 0; i < n; ++i ) a0 _= (float)i;
for( i = 0; i < n; ++i ) b0 = -1;


#pragma acc parallel copy( b0, a0 )
{
#pragma acc loop
for( i = 0; i < n; ++i ) b0 = doit( a0, i );
}

for( i = 0; i < n; ++i ) if (b0 != (float(i)*float(i))) printf(“BAD\n”);

}

\

cat doit.cpp
/* one-D input/output arrays

  • routine seq
    /

    #include <stdio.h>
    #include “openacc.h”

    #pragma acc routine seq
    float doit( float
    a, int i )
    {
    return a*a;
    }

    \

pgc++ -c -ta=tesla -fPIC -Minfo=acc routine.cpp doit.cpp
routine.cpp:
testit():
21, Generating copy(b0[:],a0[:]) [if not already present]
Generating Tesla code
23, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
doit.cpp:
doit(float *, int):
10, Generating acc routine seq
Generating Tesla code


pgc++ -ta=tesla -shared -o r.so routine.o doit.o
g++ main.cpp r.so
export LD_LIBRARY_PATH=pwd
./a.out
Done_

Hi Brendt,

Thanks for taking a look. Indeed, this makes me think there is something unusual in the way our build system works.

For the time being I was able to refactor the code to get things working for what I need right now. I’ll circle back to this in a few months once my current obligations are out of the way.

Thanks again for your help,

-David

Sorry to revive this old thread, but I have a follow up question that is closely related.

I am using OpenACC in a code that will be distributed as a shared library, with both the library and main executable compiled and linked with PGI. For other projects I have found that it is necessary to use the -nordc option for this work correctly at runtime. The current solver relies heavily on templated structs containing static member data, similar to the code block below. These constants are accessed in almost every compute heavy loop.

<in a header file>
template <typename T> struct constants
{
        enum { d = 3 };   
        static const int c[d]; 
}

<in a header implementation file>
template<typename T> 
const T constants<T>::c[constants<T>::d] = { (T)1/(T)3, (T)1/(T)3, (T)1/(T)3 };

#pragma acc declare copyin(constants<float>::c)
#pragma acc declare copyin(constants<double>::c)

<in another header implementation file>
template <typename T, template <typename U> class TC>
void DummyClass<T, TC>::dummyMethod()
{
  <lots of parallel loops that access TC::c>
}

Without the -nordc option I confirm that I am able to offload and access this static data on the device. With -nordc I am able to compile but the contents of the static data is incorrect during device execution (it seems to be zeroed out). I recall reading somewhere that it is not possible to use the declare pragma across compilation units. Thus, I have 2 questions:

  1. Is the -nordc flag still required when utilizing OpenACC in a C++ project with both a PGI compiled host application and PGI compiled shared library?

  2. If the answer to #1 is yes, are the limitations around the declare pragma still valid? Is it recommended to declare static data separately in each compilation unit?

Thanks,

-David

Hi David,

  1. Is the -nordc flag still required when utilizing OpenACC in a C++ project with both a PGI compiled host application and PGI compiled shared library?

Late last year, our engineers were able to add RDC support when building Linux shared objects. So while you still couldn’t access these static variables outside of the shared object, within the shared object it should be fine.

Now you do need to build the shared object using the PGI drivers using the same OpenACC flags you use to compiler as well as the “-shared” (create a shared object) flag.

pgc++ test.o -acc -ta=tesla -shared -o libmylib.so

Caveat: this support is fairly new and not widely used yet, so there’s possible unknown issues.

  1. If the answer to #1 is yes, are the limitations around the declare pragma still valid? Is it recommended to declare static data separately in each compilation unit?

If using nordc, then you can’t use the “declare” directive nor “routine” where the device routine is not defined within the same file (or module for Fortran). Both require a link step in order to resolve the device symbol names where nordc disables the link.

The added support for RDC in shared objects basically runs the objects through the device linker (nvlink) and why you need to use PGI when creating the shared object.

-Mat

Hi Mat,

Thanks for the quick response. I’ll give it a try and let you know if it works. If it does work that is great news as I have another project coming down the pipe that is going to be blocked by the same problem.

-David

Hello Mat,

I have similar use case where I have to build the shared library containing global variables with declare create and I am seeing confusing behaviour. I am preparing below sample mimicking usage in our application with PGI 19.10:

$ pgc++ --version
 pgc++ 19.10-0 LLVM 64-bit target on x86-64 Linux -tp skylake

Sample :

$ cat test.cpp

#include <stdlib.h>
#include <stdio.h>

float* my_data;
#pragma acc declare create(my_data)

static double my_value = 1.0;
#pragma acc declare copyin(my_value)

double my_celsius;
#pragma acc declare create(my_celsius)

void init(double val) {
    my_data = (float*) malloc(sizeof(float)*3);
    my_data[0] = 1.1;
    #pragma acc enter data copyin(my_data[0:3])

    my_celsius = val;
    #pragma acc update device(my_celsius)
}

#pragma acc routine seq
double foo() {
    double y = 2.0;
    // 2 + 5 + 1 = 8
    return y + my_celsius + my_value;
}

void sample() {
  #pragma acc kernels present(my_celsius)
  {
     double val = foo();
     printf("\n VALUE :: %lf \n", val);
  }
}    

and driver program:

$ cat main.cpp
#include <iostream>

void init(double val);
void sample();

int main() {
    init(5.0);
    sample();
    std::cout << "Done\n";
    return 0;
}

I build this example in different ways:

$ cat build.sh
# Buil 1
pgc++ -acc -Minfo test.cpp main.cpp
./a.out

# Build 2
pgc++ -acc -ta=tesla:nordc -Minfo test.cpp -c -fPIC
pgc++ -acc -ta=tesla:nordc -shared -o test.so test.o
pgc++ -acc -ta=tesla:nordc main.cpp test.so
./a.out

# Build 3
pgc++ -acc -Minfo test.cpp -c -fPIC
pgc++ -acc -shared -o test.so test.o
pgc++ -acc main.cpp test.so
./a.out

and output we get is:

    $ export LD_LIBRARY_PATH=`pwd`:$LD_LIBRARY_PATH
    $ export PGI_ACC_NOTIFY=2
    $ bash -x build.sh
+ pgc++ -acc -Minfo test.cpp main.cpp
test.cpp:
init(double):
     18, Generating enter data copyin(my_data[:3])
     20, Generating update device(my_celsius)
foo():
     23, Generating acc routine seq
         Generating Tesla code
sample():
     31, Accelerator serial kernel generated
         Generating Tesla code
main.cpp:
+ ./a.out
upload CUDA data  file=/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp function=_Z4initd line=18 device=0 threadid=1 variable=_st__gpfs_bbp_cscs_ch_project_proj16_kumbhar_pramod_scratch_acc_lib_expr_test_cpp_my_value bytes=8
upload CUDA data  file=/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp function=_Z4initd line=18 device=0 threadid=1 variable=.attach. bytes=8
upload CUDA data  file=/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp function=_Z4initd line=18 device=0 threadid=1 variable=my_data bytes=12
upload CUDA data  file=/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp function=_Z4initd line=20 device=0 threadid=1 variable=my_celsius bytes=8

 VALUE :: 8.000000
Done
+ pgc++ -acc -ta=tesla:nordc -Minfo test.cpp -c -fPIC
init(double):
     18, Generating enter data copyin(my_data[:3])
     20, Generating update device(my_celsius)
foo():
     23, Generating acc routine seq
         Generating Tesla code
sample():
     31, Accelerator serial kernel generated
         Generating Tesla code
+ pgc++ -acc -ta=tesla:nordc -shared -o test.so test.o
+ pgc++ -acc -ta=tesla:nordc main.cpp test.so
main.cpp:
+ ./a.out
upload CUDA data  file=/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp function=_Z4initd line=18 device=0 threadid=1 variable=my_data bytes=12
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 7.0, threadid=1
host:0x615230 device:0x7fff9fafa000 size:12 presentcount:0+1 line:18 name:my_data
allocated block device:0x7fff9fafa000 size:512 thread:1
FATAL ERROR: data in update device clause was not found on device 1: name=my_celsius
 file:/gpfs/bbp.cscs.ch/project/proj16/kumbhar/pramod_scratch/acc_lib_expr/test.cpp _Z4initd line:20

+ pgc++ -acc -Minfo test.cpp -c -fPIC
init(double):
     18, Generating enter data copyin(my_data[:3])
     20, Generating update device(my_celsius)
foo():
     23, Generating acc routine seq
         Generating Tesla code
sample():
     31, Accelerator serial kernel generated
         Generating Tesla code
+ pgc++ -acc -shared -o test.so test.o
+ pgc++ -acc main.cpp test.so
main.cpp:
+ ./a.out
Failing in Thread:1
call to cuModuleGetGlobal returned error 500: Not found

In summary, this is what I am seeing:

  • Build 1 : Everything works fine if we compile everything as single executable without any library (or create static library)
  • Build 2 : If we create shared library with nordc then we get data in update device clause was not found… error. Why is that? How to get around this? If I comment out acc update device(my_celsius) in init() function then I don’t see this error. But I don’t understand without updating value on device, why this works. Is that because global variable gets copied on kernel launch? (but I don’t see that with PGI_ACC_DEBUG). (Edit : as mentioned in the next comment, this doesn’t work)
  • Build 3 : After reading previous comments my impression was this should work fine without nordc using newer PGI version. But we get cuModuleGetGlobal returned error 500. This means we always have to use nordc with shared library?

Using shared library build is critical for our application and it’s difficult to get rid of global variables because the legacy code is auto-generated from some DSL layer.

If you have any suggestions to make Build 2 / Build 3 work, it will be great help!

I read other threads on the forum and they give quite useful information:

It seems like shared library + nordc + global/static variables is no go! Here is modified example from previous that compiles but the pointers on device are NULL. This is quite confusing :

float* my_data;
#pragma acc declare create(my_data)

void init(double val) {
    my_data = (float*) malloc(sizeof(float)*3);
    my_data[0] = 1.1;
    #pragma acc enter data create(my_data[0:3])
}

#pragma acc routine seq
void print() {
    printf("--> data %p\n", my_data);
}

void sample() {
  #pragma acc parallel loop
  for(int i=0; i<3; i++)
  {
        print();
  }
}

and compiling & running fives:

+ pgc++ -acc -ta=tesla:nordc -Minfo ext.cpp -c -fPIC
y_square():
      2, Generating acc routine seq
         Generating Tesla code
+ pgc++ -acc -ta=tesla:nordc -Minfo test.cpp -c -fPIC
init(double):
     11, Generating enter data create(my_data[:3])
print():
     14, Generating acc routine seq
         Generating Tesla code
sample():
     18, Generating Tesla code
         20, #pragma acc loop gang, vector(3) /* blockIdx.x threadIdx.x */
+ pgc++ -acc -ta=tesla:nordc -shared -o test.so test.o
+ pgc++ -acc -ta=tesla:nordc main.cpp test.so
main.cpp:
+ ./a.out
--> data (nil)
--> data (nil)
--> data (nil)

my_data being nil?

(The shared library aspects are quite important for the application. Also, it’s difficult to change global/static variables + routine seq due to legacy DSL layer)

If I have global variables like above, is there any other way (even convoluted) to get this working?

Hi Pramod,

It’s my understanding that the first example should work, however this support is relatively new so there may be issues. I was able to reproduce the error with the latest NVC++ compilers so filed a bug report, TPR #28720, and sent it to our engineers for further review.

-Mat

Hi Mat, Hi Pramod,

I confirm on my end that our application works. We compile both the shared library and the parent executable with the following (without nordc):

-acc -ta=tesla:nofma -Minfo=acc -tp=px -O

The nofma and -O flags were necessary to exactly match GCC CPU results, and it seems to only have a limited impact on performance for this particular application.

However, I tried with another one of our other C++ solvers and I encounter problems similar to what Pramod mentioned for case 3. I don’t have time dig into this deeper right now but I will when I have a chance.

I should add to the chorus that this type of setup (shared library + parent executable) is going to be the norm for all of our GPU accelerated software going forward. Do you have an idea of when TPR 28720 will make it into a release?

Thanks,

-David

Thanks David. My suspicion is that its a C++ issue since I’ve had success, like you, with C, but we’ll need a compiler engineer to dig into the issue before I can say for sure.

Since I just submitted the issue, it’s way too early to know when a fix would be available. I’ll advocate to raise the priority level but that’s not a guarantee that it can be fixed any time soon.

-Mat