Parallize without manuall inlining

Hello

i wonder, if there is a possibilty for accellerating a function without inlining the the nested function calls. I know function calls within a parallel compute region are not allowed, but i think they are allowed within a data region. For example (pseudo code):

void foo(double* x, double y*){
double a, b, c;
a=(double
)malloc(n
sizeof(double));
b=(double
)malloc(n*sizeof(double));

parFoo1(x,y,c);
for (i = 0; i<M; i++){
parFoo2(a, x);
parFoo2(b, x);
parFoo1(a,b,y);
}
}

Where parFoo* are functions with corresponding acc pragmas. The problem here is, that the data for the arrays a,b,x is copied for every call in each iteration, although the data is only needed in the device for the whole loop. So my idea was to define a data region like this in foo():

void foo(){
#pragma acc data region copyin(x), copyout(y), local(a,b)
{
… the code …
}
}

Unfortunally my approach did not work, because the compiler still copies the the arrays within the accelerated functions parFoo*(), i get a feedback like this:

parFoo1:
60, Generating copyin(x[0:n-1])
Generating compute capability 1.3 kernel
62, Loop is parallelizable
Accelerator kernel generated

My question is, if there is a possibilty to define a data region and avoid copy the needed arry within the nested function call. I guess the parFoo*() function needs to know that the parameter is already a “device pointer”. Thanks for your hints. Could I use the “local” clause to realize this somehow?

Kind regards,

Tim

Hi Tim,

If this were Fortran, I say to take a look at the ‘reflected’ directive. While it won’t be available til the 11.0 release, ‘reflected’ allows you to do exactly this.

Unfortunately, you’re using C where there isn’t a way to pass information about ‘x’ from one routine to another (specifically if it’s been malloc’d on the GPU). So unfortunately, you’d need to manually inline parFoo1 and parFoo2.

Adding ‘reflected’ to C, is a long term goal. However, given the limit’s of the language, it may be awhile.

Sorry,
Mat

Hi Mat,

thanks again for your help. From my point of view the missing possibilty of sharing data within a nested function call is a big limitation of the programming model. For bigger codes it might be a lot of work to inline the functions and it makes the code less human readable, especially if you try to implement a lot of different numerical technics or algorithms, which all use the same kernel functions (e.g. a matrix vector product).

Could it not be possible to use the ‘inline’ keyword for avoiding the additional copy of the array? So the compiler could try to inline this function and could realize that the pointer given as parameter is already shared in the data region? The size of the array is spezified in the pragma data region, so no additional information are need to pass to the function.

Cheers,
Tim

From my point of view the missing possibilty of sharing data within a nested function call is a big limitation of the programming model.

We understand and agree. All hope is not lost since we can usually find solutions to difficult problems, this one is just particularly difficult given the confines of the language.

Could it not be possible to use the ‘inline’ keyword for avoiding the additional copy of the array?

It’s possible and one that we’re investigating.

  • Mat

Ok. It’s an old thread, but i still trying this method. As far as I understood with 11.4 the reflected clause is implemented in C. But for me this does not work for a code like this:

void foo(double* resticted x, double* restricted y){
#pragma acc reflected(x,y)
#pragma acc region for
for(...)
...do some work...

}

void main(){
a=(double*)malloc(n*sizeof(double));
b=(double*)malloc(n*sizeof(double));
... assign data...
#pragma acc data region copyin(a, b)
{
foo(a,b)
}
}

Can I avoid that the data is copyed in foo() again with this method? Should it work with 11.4? I get an error like this:

pgcc -fastsse -DDEBUG -ta=nvidia,cc20 -Minfo -g -c solver.c
PGC-S-0035-Syntax error: Recovery attempted by replacing identifier reflected by keyword cache (solver.c: 10)
PGC-S-0036-Syntax error: Recovery attempted by inserting <nl> before acc (solver.c: 11)
PGC-S-0037-Syntax error: Recovery attempted by deleting identifier region (solver.c: 11)
PGC-S-0036-Syntax error: Recovery attempted by inserting <nl> before keyword for (solver.c: 13)
PGC-W-0155-Long value is passed to a nonprototyped function - argument #3 (solver.c: 162)
PGC/x86-64 Linux 11.4-0: compilation completed with severe errors

Sounds that the reflected Keyword is not known or why trys the compiler to replace it with cache?

Hi Tim,

The reflected clause is not supported for C as of yet. It’s a new feature defined in the PGI Accelerator 1.3 spec, but not yet implemented. I’m sorry that you got the impression that it was added in 11.4.

  • Mat

Hi Mat,

thanks for the very fast answer again. Is there a plan in which version reflected will be implemented for C? Is there a list which part of the PGI Accelerator 1.3 spec is implemented in which compiler?

To avoid this manuell function inlining, is it possible to use acc_malloc, and define a data region with “deviceptr” somehow? How would this work?

Cheers,
Tim

Hi Tim,

We’ll start rolling out the 1.3 Spec features in the PGI 2012 compilers. I’m not sure which will be available in November’s initial 12.0 release, but I know adding reflected for C as well as acc_malloc and deviceptr are high on the priority list.

“acc_malloc” and “deviceptr” will allow you to create device pointers that can be accessed within or outside of data regions. Unfortunately, I don’t have an example at this time, but should be able to put one together once these features are added.

  • Mat