Clarification on using OpenACC in a shared library

Thanks Mat,

I’ll keep you posted as if/when I encounter any related issues.

-David

Thank you very much David and Mat for providing quick feedback.

If there is anything I should try or experiment with, let me know.

Just for the context, I am integrating GPU support in NEURON framework and integrating static libraries is not straightforward (or at least doesn’t match with how large number of users use this software). So any help / feedback will be highly appreciated.

Dear Mat,

Just to check, was there any response/suggestion from compiler dev team?

Thank you!

No, sorry, nothing yet.

Hi Mat, I’m just checking in on this issue to see if there has been any progress or news from the compiler dev team. I’ve been tasked with reworking two of our standalone solvers as shared libraries with a very tight deadline. I started today and ran into a bunch of issues with both PGI 19.10 and the latest NVIDIA HPC SDK (20.9). I’ll try to reproduce these issues in a simplified manner tomorrow but hopefully there are already some fixes in the pipeline.

During OpenACC summit few months ago I asked about this issue (Michael Wolfe and Mat were there during the breakout session). I am wondering if compiler dev team has any feedback or workaround solution on this.

If there is no ETA for resolution of this issue, I am wondering if we should give up the shared library approach and use static library instead (which going to impact our user workflows).

Hi Pramod,

The work around would be to not use RDC (i.e. -gpu=nordc) with shared objects. However without RDC, the code can’t contain any static global references in “declare” directives, like extern variables, nor allows for cross-file device routine calls. So you’d need to reorganize the code so “my_data” is passed in as an argument, rather a global reference.

For example:

% cat test1.cpp
#include <stdlib.h>
#include <stdio.h>

void init(double val, float** my_data, double & my_celsius) {
    *my_data = (float*) malloc(sizeof(float)*3);
    (*my_data)[0] = 1.1;
    float * tmp = *my_data;
    #pragma acc enter data copyin(tmp[0:3])

    my_celsius = val;
    #pragma acc enter data copyin(my_celsius)
}

#pragma acc routine seq
double foo (double my_celsius,  double my_value) {
    double y = 2.0;
    // 2 + 5 + 1 = 8
    return y + my_celsius + my_value;
}

double sample(double & my_celsius,double my_value) {
  double val;
  #pragma acc serial present(my_celsius) copyout(val)
  {
      val = foo(my_celsius, my_value);
  }
  return val;
}
dev-sky5:/local/home/colgrove/tmp% cat main.cpp
#include <iostream>

void init(double val, float** my_data, double & my_celsius);
double sample(double &my_celsius,double my_value);
static double my_value = 1.0;

int main() {

    float * my_data;
    double my_celsius;
    init(5.0,&my_data,my_celsius);
    double val = sample(my_celsius,my_value);
    std::cout << "Val=" << val << std::endl;
    std::cout << "Done\n";
    return 0;
}
% nvc++ -acc -Minfo -gpu=cc70 test1.cpp -c -fPIC
init(double, float **, double &):
     10, Generating enter data copyin(tmp[:3])
     12, Generating enter data copyin(my_celsius[:1])
foo(double, double):
     15, Generating acc routine seq
         Generating Tesla code
sample(double &, double):
     24, Generating present(my_celsius[:1])
         Generating copyout(val) [if not already present]
         Accelerator serial kernel generated
         Generating Tesla code
% nvc++ -acc -gpu=cc70 -shared -o test.so test1.o
% nvc++ -acc main.cpp test.so -gpu=cc70
main.cpp:
% a.out
Val=8
Done

-Mat

Thanks Mat for providing an example.

In our use case currently its difficult to change the code structure because its generated from DSL to C++ transpiler and difficult to change API. But I will keep this in mind.