Parallel Prefix Sum on the GPU (Scan)

elephant · April 28, 2011, 11:24am

Hello,

does anybody have a FORTRAN code example for an “up-sweep” of an array that is optimal for the PGI Accelerator Programming Model?

1, 2, 3, 4, 5, 6, 7, 8, 9, …
1, 1+2, 3, 3+4, 5, 5+6, 7, 7+8, 9,…
…

Thank you!

MatColgrove · April 28, 2011, 11:00pm

Hi elephant,

I don’t, but maybe someone else does.

Do you have an non-accelerator example source (that’s parallel) you could post? Granted, I’m not sure upsweep will work well on a GPU, but having something to start from would be helpful.

Mat

elephant · May 6, 2011, 9:40am

Hi,
The example source for an up-sweep would look something like this:

do d=0,int(log(dble(N))/log(2.0)-1)
   do i=0,N-1,(2**(d+1))
      T(i+2**(d+1)-1)=T(i+2**d-1)+T(i+2**(d+1)-1)
   end do
end do

Where T is the 1-dim array of size N which entries has to be summed up.

But since I only have to sum up every 8 neighboring entries (sum(1:8), sum(9:16),… ) and not the whole array, I rearanged N to another array Q of size (N/8,8) and wrote a code like the following and got very good speed-up!!!

!$acc region
   do i=1,N/8
      Qtemp1(i) = Q(i,1)+Q(i,2)+Q(i,3)+Q(i,4)     &
                 +Q(i,5)+Q(i,6)+Q(i,7)+Q(i,8)
   end do	  
!$acc end region

So, problem solved.
Thank you!