Clarification on using OpenACC in a shared library

Hello,

I am reworking an existing standalone C/C++/OpenACC CFD solver as a shared library for inclusion in a larger application. From a high level, it looks like this:
main application, compiled and linked with gcc or PGI.
– geometry, mesh, graphics, etc libraries accessed with dlopen, all compiled with gcc.
– solver library accessed with dlopen, compiled with PGI and OpenACC support.

A first test with OpenACC disabled was successful, the solver library ran as expected. However, when I compile with OpenACC enabled I encounter some isses. The solver starts up as expected and nvidia-smi shows the expected number of MPI processes assigned to the targeted device (set with CUDA_VISIBLE_DEVICES). However, the application hangs at the first acc API call (acc_get_memory() in this case).
I have seen other postings indicating that the -ta=tesla:nordc flag must be used when preparing a shared library. Is this still the recommended solution for PGI 19.X? This is unfortunately a problem for my solver, which is a big C/C++ monster with extensive use of acc routine and global variables.
A fallback would be to link in the solver statically but that would go against the larger design philosophy of this application. I also intend to try inlining but it is going to be a lot of inlined code which makes me a bit wary.

Thanks for you help,
-David

The issue requiring nordc in shared libraries was addressed in PGI 19.1, so if you use any 2019 PGI compiler, you should not have to worry about that.

Can you link with pgc++? We will put the proper init section in, in that case. If not, I think there is still a way. Mat is the expert in this area, unfortunately he is out this week. I’ll do a little checking.

Thanks for the response.

Here is where I currently stand:

If I compile the solver library without nordc and link with pgc++ there is a hang at the first acc_* API call. If I compile and link with pgc++ with nordc the acc_* API calls work. However, the solver library then crashes at the first parallel region. I had to strip out a lot of the code to make this compile at all with nordc, so it is possible I broke something on the way. I’ll continue with this and return to this thread when I have more details.

Could you elaborate on what you mean by “We will put the proper init section in”. Is this somehow related to acc_init()? When you say “link with pgc++”, are you referring to the solver library or the parent application?

-David

I mean the parent application. An init section is something that gets called when the program is loaded, before “main” is called.

Another update:

After some more trial and error I was able to get acc_ API calls and a few simple parallel loops working from the solver library if I do the following:

  • Remove all pragma acc declare statements for global variables
  • Compile and link the solver library with pgc++ using the nordc option.
  • Link the parent application with pgc++

The nordc argument seems to be critical, at least for my application. I’m building with PGI 19.4, do you think it is worth upgrading to a newer version?

Moving on to pragma acc routine: is there a way to force pgc++ to inline specific functions via a preprocessing macro? I’m dealing with a complicated make system that makes it messy to force inlining through the -Minline command line arguments.

Thanks,

-David

Yeah, that makes sense. The global data in declare directives is a case of what needs to get setup at program init time, which we can handle when you link with pgc++.

These are user functions you are trying to inline, correct? Not in header files…

Yes, these are user functions/methods. Some defined in header files, which seem to work, and some in source (*.C) files which are giving me the problems.

I should also note that some of these routines are nested.

For example:
I have class A where the parallelism is exposed. Class A hosts a pointer to class B, which in turn calls a utility function implemented elsewhere. A system of create/attach methods offloads everything to the GPU and keeps the pointer tree up to date.

–in file A.C–
#pragma acc parallel loop present(B)
for (i=1, i<n; i++)
{
B->foo(i)
}

–in file B.H–
class B
{
#pragma acc routine seq
void foo(int i)
}

–in file B.C–
B::foo(int i)
{
C->foo(i)
}

–in file util.H–
#pragma acc routine seq
void utility(int i)

–in file util.C–
utility(int i)
{

}

As currently configured, I encounter compile errors like:
ptxas fatal : Unresolved extern function ‘Z9utilityRKiPKdRS1_RiS4

I suppose the path around this would be to first create an inline library with -Mextract, followed by compiling with -Minline=lib:. Is this correct?

If you have “acc routines” marked appropriately, you should not get the unresolved error at link time. Inlining should be an optimization, and not required for correctness.

If you are using our compilers on OpenPOWER, or our LLVM compiler (default since 19.1) on X86, we do not support IPA yet, so -Minline is your only option for inlining across files, as you have discovered.

To be clear, the “ptxas fatal : Unresolved extern function” error occurs at compile time, not linking. This occurs due to the nordc flag, which is where I still am not quite clear. I have found that “nordc” is still mandatory to get any acc pragmas or API calls to work. From your earlier comments this does not seem to be correct.

I’ll take another look into my build system, perhaps I am missing something.

Like I said, I am not the expert in this area and Mat is out this week, but I did make a minimal reproducer, that works, and is what I think you are trying to do:

cat main.cpp
#include <stdio.h>

int testit(void);

int
main()
{
testit();
printf(“Done\n”);
return 0;
}

\

cat routine.cpp
/* one-D input/output arrays

  • routine seq
    */

#include <stdio.h>
#include “openacc.h”

extern float doit( float* a, int i );
#pragma acc routine(doit) seq

void
testit()
{
float a0[1000], b0[1000];
int i, n = 1000;
for( i = 0; i < n; ++i ) a0 _= (float)i;
for( i = 0; i < n; ++i ) b0 = -1;


#pragma acc parallel copy( b0, a0 )
{
#pragma acc loop
for( i = 0; i < n; ++i ) b0 = doit( a0, i );
}

for( i = 0; i < n; ++i ) if (b0 != (float(i)*float(i))) printf(“BAD\n”);

}

\

cat doit.cpp
/* one-D input/output arrays

  • routine seq
    /

    #include <stdio.h>
    #include “openacc.h”

    #pragma acc routine seq
    float doit( float
    a, int i )
    {
    return a*a;
    }

    \

pgc++ -c -ta=tesla -fPIC -Minfo=acc routine.cpp doit.cpp
routine.cpp:
testit():
21, Generating copy(b0[:],a0[:]) [if not already present]
Generating Tesla code
23, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
doit.cpp:
doit(float *, int):
10, Generating acc routine seq
Generating Tesla code


pgc++ -ta=tesla -shared -o r.so routine.o doit.o
g++ main.cpp r.so
export LD_LIBRARY_PATH=pwd
./a.out
Done_

Hi Brendt,

Thanks for taking a look. Indeed, this makes me think there is something unusual in the way our build system works.

For the time being I was able to refactor the code to get things working for what I need right now. I’ll circle back to this in a few months once my current obligations are out of the way.

Thanks again for your help,

-David

Sorry to revive this old thread, but I have a follow up question that is closely related.

I am using OpenACC in a code that will be distributed as a shared library, with both the library and main executable compiled and linked with PGI. For other projects I have found that it is necessary to use the -nordc option for this work correctly at runtime. The current solver relies heavily on templated structs containing static member data, similar to the code block below. These constants are accessed in almost every compute heavy loop.

<in a header file>
template <typename T> struct constants
{
        enum { d = 3 };   
        static const int c[d]; 
}

<in a header implementation file>
template<typename T> 
const T constants<T>::c[constants<T>::d] = { (T)1/(T)3, (T)1/(T)3, (T)1/(T)3 };

#pragma acc declare copyin(constants<float>::c)
#pragma acc declare copyin(constants<double>::c)

<in another header implementation file>
template <typename T, template <typename U> class TC>
void DummyClass<T, TC>::dummyMethod()
{
  <lots of parallel loops that access TC::c>
}

Without the -nordc option I confirm that I am able to offload and access this static data on the device. With -nordc I am able to compile but the contents of the static data is incorrect during device execution (it seems to be zeroed out). I recall reading somewhere that it is not possible to use the declare pragma across compilation units. Thus, I have 2 questions:

  1. Is the -nordc flag still required when utilizing OpenACC in a C++ project with both a PGI compiled host application and PGI compiled shared library?

  2. If the answer to #1 is yes, are the limitations around the declare pragma still valid? Is it recommended to declare static data separately in each compilation unit?

Thanks,

-David

Hi David,

  1. Is the -nordc flag still required when utilizing OpenACC in a C++ project with both a PGI compiled host application and PGI compiled shared library?

Late last year, our engineers were able to add RDC support when building Linux shared objects. So while you still couldn’t access these static variables outside of the shared object, within the shared object it should be fine.

Now you do need to build the shared object using the PGI drivers using the same OpenACC flags you use to compiler as well as the “-shared” (create a shared object) flag.

pgc++ test.o -acc -ta=tesla -shared -o libmylib.so

Caveat: this support is fairly new and not widely used yet, so there’s possible unknown issues.

  1. If the answer to #1 is yes, are the limitations around the declare pragma still valid? Is it recommended to declare static data separately in each compilation unit?

If using nordc, then you can’t use the “declare” directive nor “routine” where the device routine is not defined within the same file (or module for Fortran). Both require a link step in order to resolve the device symbol names where nordc disables the link.

The added support for RDC in shared objects basically runs the objects through the device linker (nvlink) and why you need to use PGI when creating the shared object.

-Mat

Hi Mat,

Thanks for the quick response. I’ll give it a try and let you know if it works. If it does work that is great news as I have another project coming down the pipe that is going to be blocked by the same problem.

-David