I am working on DSL framework which help to generate OpenACC codes.
I am particulary having problem in accessing a global variables inside the OpenACC kernels which are declared in other files
for example consider a scenario::
main file: contain few global variables (some constants some may not be)
some value will be assigned to non-constants global inside main function or function is called to set the value to these variable (one call per variable)
master kernel file: this will contain extern declarations for globals and some functions which will be called from main to assign them values. This header file will also contain include statements for files which contain OpenACC kernels (this is like a master kernel file)
other OpenACC kernel files where functions containinng loops with OpenACC constructs
earlier this used to work but recent changes resulting in compilation error for those global variables with “undefined reference error”
so what is the correct way to handle such global variables in OpenACC, any #pragma acc constructs are required
Also please explain how to handle Module variables in Fortran (both constants and variables)
The “declare” directive creates a data region that matches the scoping unit in which it’s declared and primarily used for use with C/C++ global variables and Fortran module variables.
However it’s only required when accessing these variables directly from a device subroutines (i.e. the variable is not passed as an argument). If used in a compute region, you can use structured or unstructured data regions (enter/exit data).
The caveat of using “declare” to access global variables across files is that the device code must use Relocatable Device Code (RDC) and be linked. For libraries is can be problematic since you will need to link the main code with nvc++ or nvfortran since other compilers don’t perform this device link step.
If you are intending to build the library with nvc++/nfortran with OpenACC but use the library with code build by other compilers, then you’ll want to avoid using “declare” and add the flag “-gpu=nordc”. Again, declare is only required for direct access from subroutines called from within device code, so if this is the case here, you’d need to rewrite the code to instead pass these variables are arguments.
Sorry for the confusion i created, iam not buliding a library. That library will helps in auto code-generation which can generate OpenACC version.
Let me simlify the problem i am facing
Lets consider a Jacobi example from Nvidia mentioned here
So lets consider the loop mentioned on line number 44 which apply left and right boundary condition.
Lets pick only left boundary condition which will require imax, jmax and PI value
Now if i want to move this loop outside this file and have new source file which will contain only that loop , how can i make sure value of imax, jmax and PI is available without passing them as an argument to that function
So lets say i make file structure something like this
laplace2d.cpp
#include <stdio.h>
#include <math.h>
// imax, jmax and PI in global scope
int imax, jmax;
float pi = 2.0f * asinf(1.0f);
void left_bndcon(float* arr, int start, int end)
int main() {
imax = 4096; jmax = 4096;
............................................
............................................
left_bndcon(A, 0, n);
.............................................
..............................................
left_bndcon(A_new, 1, n);
...................................................
...................................................
}
Then i have another master kernel file which will contain
extern int imax;
extern int jmax;
extern float pi;
#include "left_bndcon_kernel.cpp"
so in this scenario how can i make sure the imax, jmax and pi is properly available on device when it enters left_bndcon loop
====================================================================
similary in Fortran
Lets say i have constant module file
MODULE CONSTANT_PAR
integer(4) :: imax, jmax
real(4) :: pi
END MODULE
then main program laplace2d.F90
PROGRAM MAIN
USE CONSTANT_PAR
USE left_bcndon_module
imax = 4096
jmax = 4096
pi = 2.0_4 * asinf(1.0_4);
..............................
..............................
call left_bndcon(A, start, end)
................................
................................
END PROGRAM
and our OpenACC kernel file
MODULE left_bndcon_module
USE CONSTANT_PAR
contains
! the OpenACC loop accessing imax, jmax, and pi
END MODULE
So in these both CPP and Fortran scenario, how can i make sure the Globals are available on device when it comes to executing this function on device without passing them as an argument to functions in which they are accessed
Previosly this used to work, but now it reports error in accessing these globals inside device loop
The documentation says that #pragma acc directive are required
but we dont know which and where to put those
Here it doesn’t matter if imax, jmax, and pi are global or not. The symbols are accessible when entering the compute region. Since they are scalars, the default is to implicitly apply firstprivate on them when each parallel thread will get it’s own private copy with each being initialized to the value from the host. Same is true for the Fortran case.
Now if left_bndcon were a device subroutine, meaning that it’s called from within a compute region, then you’ll need to use “declare”.
#pragma acc parallel loop present(arr)
for (int i = 0; i < iter; ++i) {
left_bndcon (arr, start, end);
}
In this case you’ll need to update the declaration of imax, jmax, pi to be in a declare directive and then update the device after they are assigned on the host:
// imax, jmax and PI in global scope
int imax, jmax;
float pi;
#pragma acc declare create(imax,jmax,pi)
void left_bndcon(float* arr, int start, int end);
int main() {
float * A;
int n;
imax = 16;
jmax = 16;
pi = 2.0f * asinf(1.0f);
#pragma acc update device(imax,jmax,pi)
Now you can still use declare create in the first example, though you’ll want to put the scalars in a “present” clause so the compiler knows to use the global device variable instead of defaulting to use firstprivate, it just not required.
Thanks for clarification, i have following questions for the same
So in my case it is the first scenario, use of parallel loop
so in this scenario, lets say if i have both scalars as well as vectors
you mentioned that i need to put #pragma acc declare create → at the declarations
and #pragma acc update device → after assigining them value
so my questions is, is it required to have them like that only i mean “declare create” near declarations and then “update device” just after its initialization
Or i can keep them anywhere in code, but just before calling the left_bndcon function
so for ex. can i have something like this
// imax, jmax and PI in global scope
int imax, jmax;
float pi;
void left_bndcon(float* arr, int start, int end);
int main() {
float * A;
int n;
imax = 16;
jmax = 16;
pi = 2.0f * asinf(1.0f);
..........................
...........................
#pragma acc declare create(imax,jmax,pi)
#pragma acc update device(imax,jmax,pi)
The declare directive creates a data region having the same scope and lifetime as the scoping unit in which it’s used. Hence I’d recommend using in the same scope (i.e. global) as where the scalars are declared.
Now I’ve not tried putting declare in main using globals, so not 100% what happens, but technically this would be scoped only to main (as well as any subroutines call from within main) and not global scope. When declare is used in global scope, the device variables are created upon load of the binary. Putting it in main delays the creation until entering main. Maybe it would work and our compilers “do the right thing” in this case, but it would not be standard compliant.
and what about vectors, i need to place them also in present clause??
By “vector” are you meaning the vector clause or a pointer to an array (i.e. “arr”)? I’m assuming you mean “arr” here.
By default, the compiler will attempt to implicitly copy the array to the device. However in order to do this, it must be able to determine the size of the array based on the context in which it’s used. Here the code is using a computed index so the compiler would not be unable to determine the size and why in this case, you’ll need to put in a present clause.
Now with just “arr”, the present check is only for the pointer itself, not the array it points to. If you want to check for the full array, you need to add triplet notation, i,e, “arr[0:imax*jmax]”. Triplet notation would be required if you used a copy clause instead of present.
I was asking about the static arrays declared in global scope. This was not regarding the code i posted but in general if i encounter such sceario, i wanted to know how those can be properly accessed
so for those i will required #pragma acc declare create(arr[0:n]) right??
and then i can have them in present or copyin?? when mentioning over loop like mentioning imax, jmax and pi in below example
Also regarding this #pragma acc declare create, this only goes where this variables are globally declared and not in other files where they are declared as extern right??
And regarding the #pragma update device(var_name), i can have this even lets say in left_bndcon function before start executing the loop right, before #pragma acc parallel loop??
Only if “arr” is a fixed sized array (i.e. not allocated) so the size is known when loading the binary. If n is a variable and “arr” allocated later in the program, then you’ll need to use declare on the pointer, then later after it’s allocated, use an enter data directive to create the memory arr points to.
float *arr;
#pragma acc declare create(arr) << Create a global device pointer
... later ...
arr = (float*) malloc(size_in_bytes);
#pragma acc enter data create(arr[:n]) << create the device array and "attach" it to the global device pointer
... arr elements get assigned values on the host ...
#pragma acc update device(arr[:n]) << update the values of arr on the device
Fortran is a bit different in that you only need the declare create of the allocatable array not the extra enter data (the update is still needed). Since an “allocate” statement is under the compiler’s control (as opposed to malloc which is a system call), an allocate will create both the host and device copies of the array.
Also regarding this #pragma acc declare create, this only goes where this variables are globally declared and not in other files where they are declared as extern right??
Correct. The symbols should get resolved during the device link step.
And regarding the #pragma update device(var_name), i can have this even lets say in left_bndcon function before start executing the loop right, before #pragma acc parallel loop??
Yes, the update directive can be used anywhere within the host code within the data region.
Though for performance reasons, you should try to limit the data movement between the device and host. Ideally, copy the data to the device once at the start of the program, all accesses to the array are done on the device, then copy the results back to the host at the end. Of course there can be times where it’s necessary to synchronize the host and device memories, but try to avoid this if possible.
Putting a variable in a present clause is an independent issue from it’s use in a declare directive. The same issue would also apply for unstructured data regions.
For scalars, the default behavior for a “parallel loop” (i.e. when the variable doesn’t explicitly appear in a clause) is to put these in an implicit firstprivate clause. If you want to change this behavior and use the shared global device copy of the variable, then you’d want to put the variable in a present or a copy clause. Using a firstprivate is likely slightly better for performance, but either way should produce correct results.
Now if you were updating the global scalar on the device, then this would change the behavior since it can’t be privatized. It also creates a dependency and possible race condition, so you’d likely need to add an atomic directive on the assignment.
Another case which changes the default behavior of scalars is when the variable is passed by reference to a device subroutine. Since the subroutine would be defined elsewhere, the compiler wont know if it’s address is taken in the subroutine but it must assume it does since it could. In this scenario you do need to give the compiler more information by either putting the variable in a private/firstprivate clause or a copy/present clause depending on how the variable is being managed.
Arrays are presumed to be shared with the default behavior to create an implicit copy of the array. If the full array is in a visible data region, a declare or structured data region, then no implicit copy will occur. So when using a Fortran allocatable array or a C fixed sized array in a declare, then the compiler should be able to detect this and implicitly add it to a present clause. For a C pointer, only the pointer is visible, not the data it points to. Hence the compiler will still try to implicitly copy it. “present_or” semantics would apply, meaning at runtime it will detect that the array is present and not actually perform the copy. However in order to do an implicit copy, the compiler must know the size to copy which it can usually detect for the context of the loop. For example if you use “arr[i]” where “i” is the loop index variable, it knows it can use the loop trip count for the size. However if the index is computed, “arr[j*imax+i]”, it can’t determine the size so the user must put the array in a data clause (copy, copyin, copyout, or present).
All the implicit rules are really just there for the convivence of the user. Less typing and things to worry about. However, there’s no harm in always being explicit. Given it sounds like you’re developing a method to implicitly it might be easier for you to put all arrays in a data clause. You’d need to keep track of the size of the array for the triplet notation, which might be a challenge. You can add the “default(none)” clause which causes the compiler to not apply implicit data movement and instead requires users to put shared variables in a data clause. Compilation errors is any arrays are missing and might be a good check for you during testing.
In my case these globals both scalars as well as fixed size arrays are going to be read-only inside the acc parallel loop.
As here i have a control of how the generated code should be, i think it will be wise choice to mention scalars in firstprivate and fixed size arrays in present clause as you suggested.
I was doing one experiment where instead of placing #pragma acc update device just above the #pragma acc parallel loop for the globals,
I will call some function from main where those update directives will be placed
I have something like below code structured
main.cpp
#include <stdlib.h>
#include <stdio.h>
int imax, jmax;
#pragma acc declare create(jmax)
double pi = 2.0 * asin(1.0);
#pragma acc declare create(pi)
int mian()
{
update_imax();
update_jmax();
update_pi();
// Call to first function containing acc parallel loop
}
master_kernel.cpp
// global constants
extern int imax;
extern int jmax;
extern double pi;
update_imax() {
#pragma acc update device(imax)
}
//similar function for imax and pi
But for this
I am getting following error
FATAL ERROR: data in update device clause was not found on device 1: name=imax
Considering create is in global scope, this should work right?
if i replace the update_imax() call with #pragma acc update device(imax) it works fine
What’s likely happening is that the device hasn’t been initialized yet so imax doesn’t exist. The device initialization is delayed until the first OpenACC construct is reached.
Now I would have expected upon entry to the update directive that this would have triggered the device initialization, I’ll need to talk to engineering why this doesn’t occur in this case, but I can work around it by putting some OpenACC construct (in this case a serial region) before the call.