Handling global variables in OpenACC kernels

I am working on DSL framework which help to generate OpenACC codes.

I am particulary having problem in accessing a global variables inside the OpenACC kernels which are declared in other files
for example consider a scenario::

  1. main file: contain few global variables (some constants some may not be)
    some value will be assigned to non-constants global inside main function or function is called to set the value to these variable (one call per variable)

  2. master kernel file: this will contain extern declarations for globals and some functions which will be called from main to assign them values. This header file will also contain include statements for files which contain OpenACC kernels (this is like a master kernel file)

  3. other OpenACC kernel files where functions containinng loops with OpenACC constructs

earlier this used to work but recent changes resulting in compilation error for those global variables with “undefined reference error”

so what is the correct way to handle such global variables in OpenACC, any #pragma acc constructs are required

Also please explain how to handle Module variables in Fortran (both constants and variables)

The “declare” directive creates a data region that matches the scoping unit in which it’s declared and primarily used for use with C/C++ global variables and Fortran module variables.

See: OpenACC Getting Started Guide Version 23.7 for ARM, OpenPower, x86

However it’s only required when accessing these variables directly from a device subroutines (i.e. the variable is not passed as an argument). If used in a compute region, you can use structured or unstructured data regions (enter/exit data).

The caveat of using “declare” to access global variables across files is that the device code must use Relocatable Device Code (RDC) and be linked. For libraries is can be problematic since you will need to link the main code with nvc++ or nvfortran since other compilers don’t perform this device link step.

If you are intending to build the library with nvc++/nfortran with OpenACC but use the library with code build by other compilers, then you’ll want to avoid using “declare” and add the flag “-gpu=nordc”. Again, declare is only required for direct access from subroutines called from within device code, so if this is the case here, you’d need to rewrite the code to instead pass these variables are arguments.

Hello Mat,

Sorry for the confusion i created, iam not buliding a library. That library will helps in auto code-generation which can generate OpenACC version.
Let me simlify the problem i am facing

Lets consider a Jacobi example from Nvidia mentioned here

So lets consider the loop mentioned on line number 44 which apply left and right boundary condition.
Lets pick only left boundary condition which will require imax, jmax and PI value
Now if i want to move this loop outside this file and have new source file which will contain only that loop , how can i make sure value of imax, jmax and PI is available without passing them as an argument to that function
So lets say i make file structure something like this

laplace2d.cpp

#include <stdio.h>
#include <math.h>

// imax, jmax and PI in global scope
int imax, jmax;
float pi  = 2.0f * asinf(1.0f);

void left_bndcon(float* arr, int start, int end)

int main() {
   imax = 4096;         jmax = 4096;
   ............................................
   ............................................
  left_bndcon(A, 0, n);
  .............................................
 ..............................................
 left_bndcon(A_new, 1, n);
 ...................................................
 ...................................................
} 

Then i have another master kernel file which will contain

extern int imax;
extern int jmax;
extern float pi;

#include "left_bndcon_kernel.cpp"

and then finally file containing openacc loop

left_bndcon_kernel.cpp

#include <math.h>
void left_bndcon (float *arr, int start, int end)
{
   #pragma acc loop
    for (int j = start, j < end; j++)
       arr[j*imax+ 0] = sin(pi * j / (jmax-1)); 
}

so in this scenario how can i make sure the imax, jmax and pi is properly available on device when it enters left_bndcon loop

====================================================================
similary in Fortran

Lets say i have constant module file

MODULE CONSTANT_PAR
     integer(4) :: imax, jmax
     real(4) :: pi
END MODULE

then main program laplace2d.F90

PROGRAM MAIN
     USE CONSTANT_PAR
     USE left_bcndon_module

     imax = 4096
     jmax = 4096
     pi =  2.0_4 * asinf(1.0_4);
    ..............................
    ..............................
     call left_bndcon(A, start, end)
    ................................
    ................................
END PROGRAM

and our OpenACC kernel file

MODULE left_bndcon_module
       USE CONSTANT_PAR

       contains
       ! the OpenACC loop accessing imax, jmax, and pi

END MODULE

So in these both CPP and Fortran scenario, how can i make sure the Globals are available on device when it comes to executing this function on device without passing them as an argument to functions in which they are accessed

Previosly this used to work, but now it reports error in accessing these globals inside device loop
The documentation says that #pragma acc directive are required
but we dont know which and where to put those

I’m going to assume that you meant to use “parallel loop”, not just “loop”. Loop must be contained within a compute region, otherwise it’s ignored.

void left_bndcon (float *arr, int start, int end)
{
   #pragma acc parallel loop present(arr)
    for (int j = start; j < end; j++)
       arr[j*imax+ 0] = sin(pi * j / (jmax-1)); 
}

Here it doesn’t matter if imax, jmax, and pi are global or not. The symbols are accessible when entering the compute region. Since they are scalars, the default is to implicitly apply firstprivate on them when each parallel thread will get it’s own private copy with each being initialized to the value from the host. Same is true for the Fortran case.

Now if left_bndcon were a device subroutine, meaning that it’s called from within a compute region, then you’ll need to use “declare”.

For example, left_bndcon would look like:

#pragma acc routine vector
void left_bndcon (float *arr, int start, int end)
{
   #pragma acc loop vector
    for (int j = start; j < end; j++)
       arr[j*imax+ 0] = sin(pi * j / (jmax-1)); 
}

and then called in something like:

#pragma acc parallel loop present(arr)
for (int i = 0; i < iter; ++i) {
   left_bndcon (arr, start, end);
}

In this case you’ll need to update the declaration of imax, jmax, pi to be in a declare directive and then update the device after they are assigned on the host:

// imax, jmax and PI in global scope
int imax, jmax;
float pi;
#pragma acc declare create(imax,jmax,pi)
void left_bndcon(float* arr, int start, int end);

int main() {
   float * A;
   int n;
   imax = 16;
   jmax = 16;
   pi  = 2.0f * asinf(1.0f);
#pragma acc update device(imax,jmax,pi)

Now you can still use declare create in the first example, though you’ll want to put the scalars in a “present” clause so the compiler knows to use the global device variable instead of defaulting to use firstprivate, it just not required.

Hello Mat,

Thanks for clarification, i have following questions for the same
So in my case it is the first scenario, use of parallel loop
so in this scenario, lets say if i have both scalars as well as vectors
you mentioned that i need to put
#pragma acc declare create → at the declarations
and #pragma acc update device → after assigining them value

so my questions is, is it required to have them like that only i mean “declare create” near declarations and then “update device” just after its initialization
Or i can keep them anywhere in code, but just before calling the left_bndcon function

so for ex. can i have something like this

// imax, jmax and PI in global scope
int imax, jmax;
float pi;

void left_bndcon(float* arr, int start, int end);

int main() {
   float * A;
   int n;
   imax = 16;
   jmax = 16;
   pi  = 2.0f * asinf(1.0f);
..........................
...........................
#pragma acc declare create(imax,jmax,pi)
#pragma acc update device(imax,jmax,pi)

and then i need to use present clause on loop

void left_bndcon (float *arr, int start, int end)
{
   #pragma acc parallel loop present(arr,imax,jmax,pi)
    for (int j = start; j < end; j++)
       arr[j*imax+ 0] = sin(pi * j / (jmax-1)); 
}

is this correct??
and what about vectors, i need to place them also in present clause??

The declare directive creates a data region having the same scope and lifetime as the scoping unit in which it’s used. Hence I’d recommend using in the same scope (i.e. global) as where the scalars are declared.

Now I’ve not tried putting declare in main using globals, so not 100% what happens, but technically this would be scoped only to main (as well as any subroutines call from within main) and not global scope. When declare is used in global scope, the device variables are created upon load of the binary. Putting it in main delays the creation until entering main. Maybe it would work and our compilers “do the right thing” in this case, but it would not be standard compliant.

and what about vectors, i need to place them also in present clause??

By “vector” are you meaning the vector clause or a pointer to an array (i.e. “arr”)? I’m assuming you mean “arr” here.

By default, the compiler will attempt to implicitly copy the array to the device. However in order to do this, it must be able to determine the size of the array based on the context in which it’s used. Here the code is using a computed index so the compiler would not be unable to determine the size and why in this case, you’ll need to put in a present clause.

Now with just “arr”, the present check is only for the pointer itself, not the array it points to. If you want to check for the full array, you need to add triplet notation, i,e, “arr[0:imax*jmax]”. Triplet notation would be required if you used a copy clause instead of present.

Yes Mat,

  1. I was asking about the static arrays declared in global scope. This was not regarding the code i posted but in general if i encounter such sceario, i wanted to know how those can be properly accessed

so for those i will required
#pragma acc declare create(arr[0:n]) right??

and then i can have them in present or copyin?? when mentioning over loop like mentioning imax, jmax and pi in below example

void left_bndcon (float *arr, int start, int end)
{
   #pragma acc parallel loop present(arr,imax,jmax,pi)
    for (int j = start; j < end; j++)
       arr[j*imax+ 0] = sin(pi * j / (jmax-1)); 
}
  1. Also regarding this #pragma acc declare create, this only goes where this variables are globally declared and not in other files where they are declared as extern right??

  2. And regarding the #pragma update device(var_name), i can have this even lets say in left_bndcon function before start executing the loop right, before #pragma acc parallel loop??

Only if “arr” is a fixed sized array (i.e. not allocated) so the size is known when loading the binary. If n is a variable and “arr” allocated later in the program, then you’ll need to use declare on the pointer, then later after it’s allocated, use an enter data directive to create the memory arr points to.

float *arr;
#pragma acc declare create(arr)    << Create a global device pointer
... later ...

arr = (float*) malloc(size_in_bytes);
#pragma acc enter data create(arr[:n])  << create the device array and "attach" it to the global device pointer

... arr elements get assigned values on the host ...
#pragma acc update device(arr[:n])     << update the values of arr on the device

Fortran is a bit different in that you only need the declare create of the allocatable array not the extra enter data (the update is still needed). Since an “allocate” statement is under the compiler’s control (as opposed to malloc which is a system call), an allocate will create both the host and device copies of the array.

  1. Also regarding this #pragma acc declare create, this only goes where this variables are globally declared and not in other files where they are declared as extern right??

Correct. The symbols should get resolved during the device link step.

  1. And regarding the #pragma update device(var_name), i can have this even lets say in left_bndcon function before start executing the loop right, before #pragma acc parallel loop??

Yes, the update directive can be used anywhere within the host code within the data region.

Though for performance reasons, you should try to limit the data movement between the device and host. Ideally, copy the data to the device once at the start of the program, all accesses to the array are done on the device, then copy the results back to the host at the end. Of course there can be times where it’s necessary to synchronize the host and device memories, but try to avoid this if possible.

Hi Mat,

Thanks for this detailed explanation. This is really helpful

Just a last question

once i have

#pragma acc declare create

for both scalars and static allocated arrays in global scope and then

#pragma acc update device

after their initialization

I will not required to put present or copyin for those variables in parallel loop pragma right???

Putting a variable in a present clause is an independent issue from it’s use in a declare directive. The same issue would also apply for unstructured data regions.

For scalars, the default behavior for a “parallel loop” (i.e. when the variable doesn’t explicitly appear in a clause) is to put these in an implicit firstprivate clause. If you want to change this behavior and use the shared global device copy of the variable, then you’d want to put the variable in a present or a copy clause. Using a firstprivate is likely slightly better for performance, but either way should produce correct results.

Now if you were updating the global scalar on the device, then this would change the behavior since it can’t be privatized. It also creates a dependency and possible race condition, so you’d likely need to add an atomic directive on the assignment.

Another case which changes the default behavior of scalars is when the variable is passed by reference to a device subroutine. Since the subroutine would be defined elsewhere, the compiler wont know if it’s address is taken in the subroutine but it must assume it does since it could. In this scenario you do need to give the compiler more information by either putting the variable in a private/firstprivate clause or a copy/present clause depending on how the variable is being managed.

Arrays are presumed to be shared with the default behavior to create an implicit copy of the array. If the full array is in a visible data region, a declare or structured data region, then no implicit copy will occur. So when using a Fortran allocatable array or a C fixed sized array in a declare, then the compiler should be able to detect this and implicitly add it to a present clause. For a C pointer, only the pointer is visible, not the data it points to. Hence the compiler will still try to implicitly copy it. “present_or” semantics would apply, meaning at runtime it will detect that the array is present and not actually perform the copy. However in order to do an implicit copy, the compiler must know the size to copy which it can usually detect for the context of the loop. For example if you use “arr[i]” where “i” is the loop index variable, it knows it can use the loop trip count for the size. However if the index is computed, “arr[j*imax+i]”, it can’t determine the size so the user must put the array in a data clause (copy, copyin, copyout, or present).

All the implicit rules are really just there for the convivence of the user. Less typing and things to worry about. However, there’s no harm in always being explicit. Given it sounds like you’re developing a method to implicitly it might be easier for you to put all arrays in a data clause. You’d need to keep track of the size of the array for the triplet notation, which might be a challenge. You can add the “default(none)” clause which causes the compiler to not apply implicit data movement and instead requires users to put shared variables in a data clause. Compilation errors is any arrays are missing and might be a good check for you during testing.

Thanks Mat for the thorough explanation.

In my case these globals both scalars as well as fixed size arrays are going to be read-only inside the acc parallel loop.

As here i have a control of how the generated code should be, i think it will be wise choice to mention scalars in firstprivate and fixed size arrays in present clause as you suggested.

Again thank you for all the help and information.

Thanks Mat.

I have done relevant changes and it is working nicely

Hello Mat,

I was doing one experiment where instead of placing #pragma acc update device just above the #pragma acc parallel loop for the globals,
I will call some function from main where those update directives will be placed

I have something like below code structured

main.cpp

#include <stdlib.h>
#include <stdio.h>

int imax, jmax;
#pragma acc declare create(jmax)
double pi  = 2.0 * asin(1.0);
#pragma acc declare create(pi)

int mian()
{
    update_imax();
    update_jmax();
    update_pi();

   // Call to first function containing acc parallel loop

}

master_kernel.cpp

//  global constants
extern int imax;
extern int jmax;
extern double pi;
update_imax() {
#pragma acc update device(imax)
}
//similar function for imax and pi

But for this
I am getting following error

FATAL ERROR: data in update device clause was not found on device 1: name=imax

Considering create is in global scope, this should work right?

if i replace the update_imax() call with #pragma acc update device(imax) it works fine

What’s likely happening is that the device hasn’t been initialized yet so imax doesn’t exist. The device initialization is delayed until the first OpenACC construct is reached.

Now I would have expected upon entry to the update directive that this would have triggered the device initialization, I’ll need to talk to engineering why this doesn’t occur in this case, but I can work around it by putting some OpenACC construct (in this case a serial region) before the call.

For example:

% cat main.cpp
#include <stdlib.h>
#include <stdio.h>
#include <cmath>
#include <openacc.h>
#include <accel.h>
void  update_imax();

int imax, jmax;
#pragma acc declare create(imax,jmax)
double pi  = 2.0 * asin(1.0);
#pragma acc declare create(pi)

int main()
{
#ifdef WORKS
#pragma acc serial
{
  jmax=10;
}
#endif
    imax=11;
    update_imax();
    printf("%d\n",imax);
}
% cat master_kernel.cpp
//  global constants
#include "accel.h"
extern int imax;
extern int jmax;
extern double pi;
void update_imax() {
acc_present_dump();
#pragma acc update device(imax)
}
% nvc++ main.cpp master_kernel.cpp -acc ; a.out
main.cpp:
master_kernel.cpp:
Present table dump
...uninitialized...
Loading 1008 bytes from pgi_cuda_loc
present_search threadid=1 hostptr=0x602150, offset=0, elementsize=4, dims=0, lineno=9, name=imax
check_present(0x602150:0x602154) threadid=1
check_present((nil):(nil)) threadid=1
check_present - threadid=1 find(0x602150:0x602154) returns ((nil):(nil))
FATAL ERROR: data in update device clause was not found on device 1: name=imax
 file:/local/home/mcolgrove/master_kernel.cpp _Z11update_imaxv line:9

% nvc++ main.cpp master_kernel.cpp -acc -DWORKS; a.out
main.cpp:
master_kernel.cpp:
Loading 5344 bytes from pgi_cuda_loc
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 9.0, threadid=1
host:0x604350 device:0x15356aa00100 size:4 presentcount:0+1 line:-1 name:imax file:(null)
host:0x604354 device:0x15356aa00200 size:4 presentcount:0+1 line:-1 name:jmax file:(null)
host:0x604358 device:0x15356aa00300 size:8 presentcount:0+1 line:-1 name:pi file:(null)
present_search threadid=1 hostptr=0x604350, offset=0, elementsize=4, dims=0, lineno=9, name=imax
check_present(0x604350:0x604354) threadid=1
check_present((nil):(nil)) threadid=1
check_present - threadid=1 find(0x604350:0x604354) returns (0x604350:0x604354)
present_search threadid=1 hostptr=0x604350, dims=0, lineno=9, name=imax returns 0x0+0x1
11

Sure Mat,

I will try this.