Parallize without manuall inlining

rztim · August 11, 2010, 4:16pm

Hello

i wonder, if there is a possibilty for accellerating a function without inlining the the nested function calls. I know function calls within a parallel compute region are not allowed, but i think they are allowed within a data region. For example (pseudo code):

void foo(double* x, double y*){
double a, b, c;
a=(double)malloc(nsizeof(double));
b=(double)malloc(n*sizeof(double));

parFoo1(x,y,c);
for (i = 0; i<M; i++){
parFoo2(a, x);
parFoo2(b, x);
parFoo1(a,b,y);
}
}

Where parFoo* are functions with corresponding acc pragmas. The problem here is, that the data for the arrays a,b,x is copied for every call in each iteration, although the data is only needed in the device for the whole loop. So my idea was to define a data region like this in foo():

void foo(){
#pragma acc data region copyin(x), copyout(y), local(a,b)
{
… the code …
}
}

Unfortunally my approach did not work, because the compiler still copies the the arrays within the accelerated functions parFoo*(), i get a feedback like this:

parFoo1:
60, Generating copyin(x[0:n-1])
Generating compute capability 1.3 kernel
62, Loop is parallelizable
Accelerator kernel generated

My question is, if there is a possibilty to define a data region and avoid copy the needed arry within the nested function call. I guess the parFoo*() function needs to know that the parameter is already a “device pointer”. Thanks for your hints. Could I use the “local” clause to realize this somehow?

Kind regards,

Tim

MatColgrove · August 11, 2010, 11:40pm

Hi Tim,

If this were Fortran, I say to take a look at the ‘reflected’ directive. While it won’t be available til the 11.0 release, ‘reflected’ allows you to do exactly this.

Unfortunately, you’re using C where there isn’t a way to pass information about ‘x’ from one routine to another (specifically if it’s been malloc’d on the GPU). So unfortunately, you’d need to manually inline parFoo1 and parFoo2.

Adding ‘reflected’ to C, is a long term goal. However, given the limit’s of the language, it may be awhile.

Sorry,
Mat

rztim · August 12, 2010, 10:19am

Hi Mat,

thanks again for your help. From my point of view the missing possibilty of sharing data within a nested function call is a big limitation of the programming model. For bigger codes it might be a lot of work to inline the functions and it makes the code less human readable, especially if you try to implement a lot of different numerical technics or algorithms, which all use the same kernel functions (e.g. a matrix vector product).

Could it not be possible to use the ‘inline’ keyword for avoiding the additional copy of the array? So the compiler could try to inline this function and could realize that the pointer given as parameter is already shared in the data region? The size of the array is spezified in the pragma data region, so no additional information are need to pass to the function.

Cheers,
Tim

MatColgrove · August 12, 2010, 10:35pm

From my point of view the missing possibilty of sharing data within a nested function call is a big limitation of the programming model.

We understand and agree. All hope is not lost since we can usually find solutions to difficult problems, this one is just particularly difficult given the confines of the language.

Could it not be possible to use the ‘inline’ keyword for avoiding the additional copy of the array?

It’s possible and one that we’re investigating.

Mat

rztim · July 11, 2011, 4:20pm

Ok. It’s an old thread, but i still trying this method. As far as I understood with 11.4 the reflected clause is implemented in C. But for me this does not work for a code like this:

void foo(double* resticted x, double* restricted y){
#pragma acc reflected(x,y)
#pragma acc region for
for(...)
...do some work...

}

void main(){
a=(double*)malloc(n*sizeof(double));
b=(double*)malloc(n*sizeof(double));
... assign data...
#pragma acc data region copyin(a, b)
{
foo(a,b)
}
}

Can I avoid that the data is copyed in foo() again with this method? Should it work with 11.4? I get an error like this:

pgcc -fastsse -DDEBUG -ta=nvidia,cc20 -Minfo -g -c solver.c
PGC-S-0035-Syntax error: Recovery attempted by replacing identifier reflected by keyword cache (solver.c: 10)
PGC-S-0036-Syntax error: Recovery attempted by inserting <nl> before acc (solver.c: 11)
PGC-S-0037-Syntax error: Recovery attempted by deleting identifier region (solver.c: 11)
PGC-S-0036-Syntax error: Recovery attempted by inserting <nl> before keyword for (solver.c: 13)
PGC-W-0155-Long value is passed to a nonprototyped function - argument #3 (solver.c: 162)
PGC/x86-64 Linux 11.4-0: compilation completed with severe errors

Sounds that the reflected Keyword is not known or why trys the compiler to replace it with cache?

MatColgrove · July 11, 2011, 4:59pm

Hi Tim,

The reflected clause is not supported for C as of yet. It’s a new feature defined in the PGI Accelerator 1.3 spec, but not yet implemented. I’m sorry that you got the impression that it was added in 11.4.

Mat

rztim · July 12, 2011, 7:42am

Hi Mat,

thanks for the very fast answer again. Is there a plan in which version reflected will be implemented for C? Is there a list which part of the PGI Accelerator 1.3 spec is implemented in which compiler?

To avoid this manuell function inlining, is it possible to use acc_malloc, and define a data region with “deviceptr” somehow? How would this work?

Cheers,
Tim

MatColgrove · July 12, 2011, 5:34pm

Hi Tim,

We’ll start rolling out the 1.3 Spec features in the PGI 2012 compilers. I’m not sure which will be available in November’s initial 12.0 release, but I know adding reflected for C as well as acc_malloc and deviceptr are high on the priority list.

“acc_malloc” and “deviceptr” will allow you to create device pointers that can be accessed within or outside of data regions. Unfortunately, I don’t have an example at this time, but should be able to put one together once these features are added.

Mat

Topic		Replies	Views
PGI Accelerator programming concepts questions Legacy PGI Compilers	10	12164	November 29, 2010
Inlining with pragmas Legacy PGI Compilers	5	6311	April 23, 2014
function/procedure calls not supported Legacy PGI Compilers	5	7570	March 2, 2012
declarative data error in PGI Fortran 10 Legacy PGI Compilers	10	10931	December 28, 2010
Question about loop including function/procedure call Legacy PGI Compilers	3	2258	April 24, 2012
Understanding the reflected clause Legacy PGI Compilers	4	2388	June 8, 2012
Calling functions within the kernels Legacy PGI Compilers	6	5107	July 25, 2011
Function calls within a pragma region! Legacy PGI Compilers	1	1770	July 22, 2011
Making Function Calls within Accelerator Code Blocks Legacy PGI Compilers	1	2768	September 28, 2010
function or procedure calls are not allowed Legacy PGI Compilers	1	2410	July 5, 2012

Parallize without manuall inlining

Related topics