reduction operation

Hi,

Do you have any idea to create accelerator kernels when the programs have reduction operation?
In the case like below, I couldn’t make it.

!$acc region
do i=1,n
s = s + a(i)
end do
!$acc end region

%>pgfortran -ta=nvidia -Minfo test.f
27, No parallel kernels found, accelerator region ignored
28, Loop carried scalar dependence for s

Regards

ks-fujii

I don’t think the current accelerators support reduction. Your best bet might be using CUDPP (which does have reduction), though I’m not sure CUDPP has a Fortran interface.

Hi ks-fujii,

We should have support for reductions by the November release. In the mean time, I would suggest creating an summary array to hold the intermediary calculations and then perform the reduction on the host.

For example:

!$acc region
do i=1,n
sarr(i) = a(i) * b(i) + c(i)
end do
!$acc end region 

do i=1,n
s = s + sarr(i)
end do

Hope this helps,
Mat

Hi Mat,
Could you please tell me how we can do reduction with current APM model. I searched through the manual yet haven’t found out any description.
Further, do Fortran CUDA support any similar function to perform reduction.

Thanks,
Tuan

Hi Tuan,

The PGI Accelerator Model will automatically recognize and create device code for reduction operations such as the one shown above. Unless you have a very complex reduction operation, just write in natural Fortran.

However, writing reductions in CUDA Fortran is a very complex task. Actually, writing them isn’t that hard, writing them so they perform well is. Take a look at my article on writing a Monte Carlo simulation ( http://www.pgroup.com/lit/articles/insider/v2n1a4.htm). While I don’t go too in-depth into reductions, I do give a brief summary on how they work.

  • Mat

Hi,
it’s really good that now the PGI Accelerator Model recognizes reductions and manages them. But I would like to know how the reduction works internally with the Accelerator model.
Thanks in advance.
Sandra

Hi Sandra,

Once the compiler recognizes a reduction, the compiler will generate an intermediate array to hold the reduction values for each thread. After the main kernel completes, a second highly-optimized kernel is launched to perform the actual reduction. If you’re interested, NVIDIA has posted a slide-deck detailing how to create optimized reductions HERE.

Hope this helps,
Mat

Hi Mat,
thanks for your response. So, did I understand it right, that you get synchronization by using a secend kernel? And is then the intermediate array located on CPU memory? i.e. if I use our cluster batch system, do I have to reserve memory for this intermediate array as well?
Last question: So, for your second kernel do you use the last optimized algorithm from the nvidia-reduction-slides?
Cheers, Sandra

So, did I understand it right, that you get synchronization by using a second kernel?

Yes.

then the intermediate array located on CPU memory?

No, it’s on the GPU.

So, for your second kernel do you use the last optimized algorithm from the nvidia-reduction-slides?

I do believe that this is the standard algorithm for performing reductions.

  • Mat