Can __attribute__((always_inline)) inline be used with nvc++ and functions containing #acc loops?

Hi @all,
a common occurence in programming is that you may have to do parallelizable tasks after another.
Say you want to write a function declared as #acc routine worker on the device which contains 2 sequential loops, like

acc routine worker
void foo()
#acc loop
for(size_t i=1;i<100;i++){
do stuff
}
#acc loop
for(size_t i=1;i<100;i++){
do other stuff
}
}

now if these simple for loops are something more difficult, say they are dot products, or matrix multiplications, you certainly do not want a matrix multiplication to copy and paste into your function.

What you would like to do when you are working with data on device is:

acc routine worker
void foo()
function1()
function2()
}

Unfortunately, a worker, in openacc can not call another worker loop and this is a problem, if function1 and function2 would contain worker loops.

However, we have this nice attribute:

attribute((always_inline)) inline

does this solve this situation?

so can I say
attribute((always_inline)) inline void function1(){
#acc loop
for(size_t i=1;i<100;i++){
do stuff
}

attribute((always_inline)) inline void function2(){
#acc loop
for(size_t i=1;i<100;i++){
do other stuff
}

and then call these worker loops from

acc routine worker
void foo(){
function1();
function2();
}
and is this then equal to

acc routine worker
void foo()
#acc loop
for(size_t i=1;i<100;i++){
do stuff
}
#acc loop
for(size_t i=1;i<100;i++){
do other stuff
}
}

or are the loops in function1 and function2 treated as sequential then by the compiler?

i have tried the above solution with my code, and now it does not crash anymore, regardless of the optimization levels.

However, i see no information about the parallelizings of the loops within those functions in the compiler output.

Not even that they would be treated sequentially… nvc++ is just silent about these inlined regions function1 and function2. The only thing to find out whether something correct was done is that i could not write an #acc parallel loop directive inside when the function was called from a worker (which is expected behavior, since this would only allow #acc loop), and that the on_device function returned true on runtime despite myfunction1 and myfunction2 not being marked as openmp routines.

But i got no messages in the compiler output from these regions how the loops in these inlined functions were compiled actually… whether it was sequentially, or according to their directives.

( i want to note that i find this strange that a worker can not call a worker. of course one should not be allowed to call a worker function from within a worker loop in a worker function. But not everything in a worker function is a loop, sometimes its just one large matrix multiplication after another sequentially)…

I hope someone can confirm whether

attribute((always_inline)) inline

really can save the day in this situation…

apparently not. i now saw the line where it says it ignores the parallelism. So one has basically to copy the entire code in. I.e. if one makes 200 different matrix multiplications (i.e. not in a loop, but with different arguments), which are done one by one after each other in, say, a worker routine, that would mean one has to copy the entire worker loops for the multiplications with each different argument into the routine,. and this for 200 times, because calls to functions having worker loops are not allowed…

This does not seem to make sense for projects which make difficult computations…

I guess that means i have to wait until openmp is more stable, or using raw cuda…

According to the standard, by the way, worker functions can of course call routines which have a worker loop in them;

The worker clause specifies that the procedure contains, may contain, or may call another procedure that contains a loop with a worker clause,

So i really do not know why I have to make the matrix multiplications that are called in my worker function seq in order to prevent it crashing with nvc++…

Lets see what happens if I inline them by preprocessor macros…

Apologies, to clarify my early post, from the standard:

2888 or gang-partitioned mode. For instance, a procedure with a routine worker directive may be
2889 called from within a loop that has the gang clause, but not from within a loop that has the worker
2890 clause. Only one of the gang, worker, vector, and seq clauses may appear for each device
2891 type.

My assumption being that your worker routines contained worker loops, which can’t call worker routines. “foo” is in worker-single mode, so should be ok.

Do function1 and 2 contain calls to vector routines? If not, you may want to consider using vector instead of worker here, unless the vector length is 1.

In our implementation when targeting NVIDIA devices, “gang” maps to a CUDA block, “worker” to the y dimension of the block and “vector” to the x dimension.

The default launch configuration is 32x4 or "vector_length(32), “num_workers(4)”, for 128 threads total. So in the case of a worker loop, only 4 threads will be executing. Best practice is to use gang and vectors, and only use worker if you need the third dimension.

Note, “collapse” is a useful clause when you have tightly nested loops since it expands the number of loops that can be parallelized.

Also, the “always_inline” attribute isn’t really needed since the “inline” keyword is sufficient for these routines to be auto-inlined.