reduction operation

ks-fujii · July 15, 2009, 9:15am

Hi,

Do you have any idea to create accelerator kernels when the programs have reduction operation?
In the case like below, I couldn’t make it.

!$acc region
do i=1,n
s = s + a(i)
end do
!$acc end region

%>pgfortran -ta=nvidia -Minfo test.f
27, No parallel kernels found, accelerator region ignored
28, Loop carried scalar dependence for s

Regards

ks-fujii

TheMatt · July 15, 2009, 1:03pm

I don’t think the current accelerators support reduction. Your best bet might be using CUDPP (which does have reduction), though I’m not sure CUDPP has a Fortran interface.

MatColgrove · July 15, 2009, 3:01pm

Hi ks-fujii,

We should have support for reductions by the November release. In the mean time, I would suggest creating an summary array to hold the intermediary calculations and then perform the reduction on the host.

For example:

!$acc region
do i=1,n
sarr(i) = a(i) * b(i) + c(i)
end do
!$acc end region 

do i=1,n
s = s + sarr(i)
end do

Hope this helps,
Mat

Tuan · June 10, 2010, 8:11pm

Hi Mat,
Could you please tell me how we can do reduction with current APM model. I searched through the manual yet haven’t found out any description.
Further, do Fortran CUDA support any similar function to perform reduction.

Thanks,
Tuan

MatColgrove · June 10, 2010, 10:55pm

Hi Tuan,

The PGI Accelerator Model will automatically recognize and create device code for reduction operations such as the one shown above. Unless you have a very complex reduction operation, just write in natural Fortran.

However, writing reductions in CUDA Fortran is a very complex task. Actually, writing them isn’t that hard, writing them so they perform well is. Take a look at my article on writing a Monte Carlo simulation ( Account Login | PGI). While I don’t go too in-depth into reductions, I do give a brief summary on how they work.

Mat

sWienke · September 7, 2010, 8:56am

Hi,
it’s really good that now the PGI Accelerator Model recognizes reductions and manages them. But I would like to know how the reduction works internally with the Accelerator model.
Thanks in advance.
Sandra

MatColgrove · September 9, 2010, 6:28pm

Hi Sandra,

Once the compiler recognizes a reduction, the compiler will generate an intermediate array to hold the reduction values for each thread. After the main kernel completes, a second highly-optimized kernel is launched to perform the actual reduction. If you’re interested, NVIDIA has posted a slide-deck detailing how to create optimized reductions HERE.

Hope this helps,
Mat

sWienke · September 10, 2010, 6:53am

Hi Mat,
thanks for your response. So, did I understand it right, that you get synchronization by using a secend kernel? And is then the intermediate array located on CPU memory? i.e. if I use our cluster batch system, do I have to reserve memory for this intermediate array as well?
Last question: So, for your second kernel do you use the last optimized algorithm from the nvidia-reduction-slides?
Cheers, Sandra

MatColgrove · September 10, 2010, 3:52pm

So, did I understand it right, that you get synchronization by using a second kernel?

Yes.

then the intermediate array located on CPU memory?

No, it’s on the GPU.

So, for your second kernel do you use the last optimized algorithm from the nvidia-reduction-slides?

I do believe that this is the standard algorithm for performing reductions.

Mat

Topic		Replies	Views
Accelerated Reductions and PGI 10.5 Legacy PGI Compilers	1	6267	May 11, 2010
Reduction operation in C Legacy PGI Compilers	1	2594	March 16, 2011
reduction within "!$acc kernels loop" ? Legacy PGI Compilers	8	5494	January 11, 2013
[Help] Using reduction with Array Legacy PGI Compilers	14	3373	March 21, 2024
3D Array Reduction in ACC? Legacy PGI Compilers	6	3755	January 10, 2013
matrix reduction using cuda fortran and GPU Legacy PGI Compilers	33	13833	December 21, 2012
Significant deterioration of performance with array reduction in OpenACC Legacy PGI Compilers	7	1157	April 22, 2022
Array reductions with OpenACC to compute averages of field data nvc, nvc++ and nvfortran	3	528	August 18, 2022
Reduction not recognized in Fortran Legacy PGI Compilers	6	3428	June 1, 2012
Reduction prevents parallel execution on two GPUs Legacy PGI Compilers	5	5779	March 11, 2014

reduction operation

Regards

Related topics