I expected that the compiler would spot the reduction on the inner loop. Is there a better way to let it be detected or should I wait for a later compiler version where the reduction clause is supported?
The “kernels” model works on tightly nested loops. Here, the inner loop is not tightly nested so would not be accelerated and using the “reduction” clause would not help. (note that in your code tmp isn’t a reduction in this case anyway)
I think what you really want is a way to express that the outer loop is being performed by a “gang” while the inner loop performed by a “vector”. Something like:
#pragma acc data copyin(x_in,N), copy(x_out), create(k,n,tmp)
{
#pragma acc parallel
{
#pragma acc loop gang
for(k=0;k<N;k++){
! this section of code performed by one thread in the gang
! if (thread == 1) then
tmp = 0
! end if
! call syncthreads() to syncronize the threads
#pragma acc loop vector(32)
for(n=0;n<N;n++){
! perform this loop in parallel across all threads in a gang
! creating a partial sum
tmp = tmp + (x_in[n] * cos((-2*M_PI/N)*k*n));
}
! call syncthreads()
! Back into sequential code
! if (thread == 1) then
! perform the final sum reduction of tmp and store the results back to memory
x_out[k] = tmp;
! end if
! call syncthreads()
}
}
}
We just started to get requests like this in the last few months and are investigating how we can express this in OpenACC. We’re not sure if we can do this within the current “parallel” model specs, or if the OpenACC API needs to be extended. It’s very early.