I don’t think the current accelerators support reduction. Your best bet might be using CUDPP (which does have reduction), though I’m not sure CUDPP has a Fortran interface.
We should have support for reductions by the November release. In the mean time, I would suggest creating an summary array to hold the intermediary calculations and then perform the reduction on the host.
For example:
!$acc region
do i=1,n
sarr(i) = a(i) * b(i) + c(i)
end do
!$acc end region
do i=1,n
s = s + sarr(i)
end do
Hi Mat,
Could you please tell me how we can do reduction with current APM model. I searched through the manual yet haven’t found out any description.
Further, do Fortran CUDA support any similar function to perform reduction.
The PGI Accelerator Model will automatically recognize and create device code for reduction operations such as the one shown above. Unless you have a very complex reduction operation, just write in natural Fortran.
However, writing reductions in CUDA Fortran is a very complex task. Actually, writing them isn’t that hard, writing them so they perform well is. Take a look at my article on writing a Monte Carlo simulation ( http://www.pgroup.com/lit/articles/insider/v2n1a4.htm). While I don’t go too in-depth into reductions, I do give a brief summary on how they work.
Hi,
it’s really good that now the PGI Accelerator Model recognizes reductions and manages them. But I would like to know how the reduction works internally with the Accelerator model.
Thanks in advance.
Sandra
Once the compiler recognizes a reduction, the compiler will generate an intermediate array to hold the reduction values for each thread. After the main kernel completes, a second highly-optimized kernel is launched to perform the actual reduction. If you’re interested, NVIDIA has posted a slide-deck detailing how to create optimized reductions HERE.
Hi Mat,
thanks for your response. So, did I understand it right, that you get synchronization by using a secend kernel? And is then the intermediate array located on CPU memory? i.e. if I use our cluster batch system, do I have to reserve memory for this intermediate array as well?
Last question: So, for your second kernel do you use the last optimized algorithm from the nvidia-reduction-slides?
Cheers, Sandra