OpenACC reductions

nickaj · March 26, 2012, 4:29pm

Hi,

I was trying to implement a 1D fft using the OpenACC directives support in 12.3

Here’s the core of my code:

#pragma acc data copyin(x_in,N), copy(x_out), create(k,n,tmp)
  {
#pragma acc kernels
#pragma acc loop gang(32), vector(16)
    for(k=0;k<N;k++){
#pragma acc loop
      for(n=0;n<N;n++){
        tmp = x_in[n] * cos((-2*M_PI/N)*k*n);
      }
      x_out[k] = tmp;
    }
  }

I expected that the compiler would spot the reduction on the inner loop. Is there a better way to let it be detected or should I wait for a later compiler version where the reduction clause is supported?

MatColgrove · March 26, 2012, 9:25pm

Hi nickaj,

The “kernels” model works on tightly nested loops. Here, the inner loop is not tightly nested so would not be accelerated and using the “reduction” clause would not help. (note that in your code tmp isn’t a reduction in this case anyway)

I think what you really want is a way to express that the outer loop is being performed by a “gang” while the inner loop performed by a “vector”. Something like:

#pragma acc data copyin(x_in,N), copy(x_out), create(k,n,tmp)
  {
#pragma acc parallel
{
#pragma acc loop gang
    for(k=0;k<N;k++){

! this section of code performed by one thread in the gang
! if (thread == 1) then
   tmp = 0   
! end if 
! call syncthreads()  to syncronize the threads

#pragma acc loop vector(32)
      for(n=0;n<N;n++){
! perform this loop in parallel across all threads in a gang
! creating a partial sum
        tmp = tmp + (x_in[n] * cos((-2*M_PI/N)*k*n));
      }
! call syncthreads()
! Back into sequential code
!  if (thread == 1) then
! perform the final sum reduction of tmp and store the results back to memory
      x_out[k] = tmp;
! end if
! call syncthreads()
    }
}
  }

We just started to get requests like this in the last few months and are investigating how we can express this in OpenACC. We’re not sure if we can do this within the current “parallel” model specs, or if the OpenACC API needs to be extended. It’s very early.

Mat

Topic		Replies	Views
should use to "acc reduction" in an inner loop Legacy PGI Compilers	4	4186	December 6, 2012
Reduction results in wrong results. Bug? Legacy PGI Compilers	8	7635	January 24, 2014
Specified loop mapping schedule not applied (PGI Acc) Legacy PGI Compilers	2	1644	January 23, 2012
Question about the reduction clause in OpenACC Legacy PGI Compilers	1	2011	July 29, 2013
a 3 levels of loop Legacy PGI Compilers	1	2058	September 6, 2012
#pragma acc kernels loop Versus #pragma acc parallel loop Legacy PGI Compilers	3	10688	June 1, 2015
grouping specific loops into a kernel Legacy PGI Compilers	1	1751	May 7, 2013
Significant deterioration of performance with array reduction in OpenACC Legacy PGI Compilers	7	1029	April 22, 2022
OpenACC routine behavior nvfortran nvc, nvc++ and nvfortran	4	22	April 11, 2025
Performance of pgi openaccfor a matrix-matrix multiplication Legacy PGI Compilers	2	4735	May 1, 2014

OpenACC reductions

Related topics