Confused Newbie could do with aq little help

Hi,

I’m new to accelerator programming and a little bit confused by some issues I am having. I am trying to accelerate a nested loop and findng some - to me - odd behaviour.

The problem code is:


icorr = 0.00
jcorr = 0.00

! ----------------------------------------------------------------------

!$acc region copyin(density,pressure,mass,ZONE_ID,NPIZPL,ZPLIST)
!$acc do parallel, independent

do i = 1,num_particles

count = 0
itemp = 0.00

!$acc do parallel, independent, private(j), reduction(+:jcorr)

do p_j = 1,NPIZPL(ZONE_ID(1,i),ZONE_ID(2,i),ZONE_ID(3,i))

j = ZPLIST(ZONE_ID(1,i),ZONE_ID(2,i),ZONE_ID(3,i),p_j)

! ----------------------------------------------------------------------

pr_i = pressure(i) / (density(i) * density(i))
pr_j = pressure(j) / (density(j) * density(j))

p_v = pr_i + pr_j

! ----------------------------------------------------------------------

count = count + 1

iindx(count) = i
itemp(1,count) = (mass(j) * p_v)

jcorr(1,j) = jcorr(1,j) + (mass(i) * p_v)


! ----------------------------------------------------------------------

enddo ! j

! -------------------------------------------------------------
! uncommenting this obviously changes the value of icorr
! but it also changes value of jcorr…

do p = 1,count
icorr(1,iindx§) = icorr(1,iindx§) + itemp(1,p)
enddo
! -------------------------------------------------------------

enddo ! i
!$acc end region


do p = 1,num_particles
print*,p,jcorr(1,p),icorr(1,p)
enddo


The issue I have is that uncommenting the p loop alters the value of jcorr when jcorr is not set in the loop. I assume that this is a more general oversight on my part. Any hints or tips on how I could improve on this would be very welcome.

Hi TTilford,

The core problem with this code is that the “jcorr” and “icorr” arrays aren’t independent within the “i” loop. Conceptionally, every iteration of the “i” loop is running at the same time. Hence, they are all updating the same “jcorr(1,j)”. By adding “independent” you’ve told the compiler to parallelize the loop anyway. Removing “independent” and adding the compiler flag “-Minfo=accel”, the compiler will correctly flag this dependency.

The reduction clause is only valid for scalars, hence “reduction(+:jcorr)” is not allowed. While we are investigating allowing array reductions as an extension to the OpenACC API, it wouldn’t help you here since the dependency is with the outer loop. Actually in the inner “p_j” loop, “jcorr” doesn’t have a dependency so doesn’t need the reduction.

Your solutions to this problem are to either parallelize only the inner loops or add another dimension to “icorr” and “jcorr” to correspond to the “i” dimension.

Actually, scratch that. The inner “p_j” isn’t parallelizable either. While you can put “count” in a reduction, you can’t use the intermediary values. The problem being that in order to update the “count” element of “iindx”, all the previous values of count need to be computed before this loop iteration knows it’s “count”.

Though, why use “iindx” at all? Aren’t all of it’s values “i”? Removing “iindx” and then manually privatizing “itemp” (i.e. add an extra dimension corresponding to the “i” index) or putting it in a “private” clause, might work around the dependency. After further review, I’m thinking that the “itemp” array may not even be necessary.

Finally, I’d move to using OpenACC syntax rather than PGI Accelerator model syntax. OpenACC is an open standard supported by multiple compilers so will make your code more portable.

Here’s my best guess on how to get this code to accelerate. Though without a complete example I don’t know if it will work in your larger application. I’m also assuming that the computed “j” values are all distinct.


!$acc data create(icorr2,jcorr2) &
!$acc      copyin(density,pressure,mass,ZONE_ID,NPIZPL,ZPLIST) &
!$acc      copyout(icorr(1,1:num_particles),jcorr(1,1:num_particles))

!$acc kernels
 icorr2 = 0.00
 jcorr2 = 0.00
!$acc end kernels

!$acc kernels loop gang worker independent
 do i = 1,num_particles
 !$acc loop vector independent
   do p_j = 1,NPIZPL(ZONE_ID(1,i),ZONE_ID(2,i),ZONE_ID(3,i))

     j = ZPLIST(ZONE_ID(1,i),ZONE_ID(2,i),ZONE_ID(3,i),p_j)
! ----------------------------------------------------------------------
     pr_i = pressure(i) / (density(i) * density(i))
     pr_j = pressure(j) / (density(j) * density(j))
     p_v = pr_i + pr_j
     icorr2(i,p) = icorr2(i,p) + (mass(j) * p_v)
     jcorr2(i,j) = jcorr2(i,j) + (mass(i) * p_v)
! ----------------------------------------------------------------------
   enddo ! j
 enddo ! i
 
!$acc kernels loop
 do p = 1,num_particles
   tempJ = 0.0
   tempI = 0.0
!$acc loop reduction(+:tempJ,tempI)
   do i = 1,num_particles
      tempJ = tempJ + jcorr2(i,p)
      tempI = tempI + icorr2(i,p)
   enddo
   icorr(1,p) = tempI
   jcorr(1,p) = tempJ
 enddo
!$acc end data

 do p = 1,num_particles
   print*,p,jcorr(1,p),icorr(1,p)
 enddo
  • Mat

Mat,

Thanks ever so much for this!

The Iindx and itemp were really me just exploring possibilities of getting the code to work and I was trying to parallelise both the i and j loops as the p_j loop will be a maximum of about 64 elements and I have 1536 cores on my GPU card.

I did actually intend that multiple i loop threads to update the same jcorr value - is this not feasible?

The code calculates interactions between a number of particles in a domain. Each particle interacts with each other. If I have 10 particles then loop i = 1,10 and loop j = 1,10 (with i not equal j removed from my code for simplicity) to consider each particle combination. Calculate a correction value (icorr and jcorr) for each particle. I assume that this is quite a common scientific computing task - are you aware of any general guidance on how to deal with this type of problem?

My code is a little more complex as rather than looking at every interaction it only looks at interactions between particles within a certain distance of each other so i could be a few million, p_j a few tens so parallelising the p_j loop is less effective than parallelising the i loop.

Tim.

I have come up with a quick test code to assess gpu acceleration of multi-body problems

The code assesses the gravitational attraction forces of four balls places on a flat surface. I have written the code in two routines - one is a simple n by n approach while the second uses the symmetrical nature of the inter-ball forces to come up with the answer doing half the work.

Could you suggest an approach to accelerate the code? I will see what I can come up with myself.




program accel_test

call test_1()

call test_2()

end


! ----------------------------------------------------------------------

!

! test sub n by n and no acceleration

!

! ----------------------------------------------------------------------
subroutine test_1()

use openacc

real8 r
real
8 ij_force

real8 position(1:2,1:4)
real
8 mass(1:4)
real*8 grav_constant

real*8 force(1:2,1:4)

integer n,i,j,b


! ----------------------------------------------------------------------
! define geometry - 4 balls on a surface

n = 4

! ball 1 at x = 0, y = 0

position(1,1) = 0.0
position(2,1) = 0.0

! ball 2 at x = 2, y = 0

position(1,2) = 2.0
position(2,2) = 0.0

! ball 3 at x = 2, y = 1

position(1,3) = 2.0
position(2,3) = 1.0

! ball 4 at x = 0, y = 1

position(1,4) = 0.0
position(2,4) = 1.0


! ----------------------------------------------------------------------
! mass of each ball = 1000 Kg
! gravitational constant = 6.674e-11

mass = 1000.0

grav_constant = 6.674e-11


! initialise force = 0

force = 0.0


! ----------------------------------------------------------------------

print*,’ ’
print*,’ n * n non accelerated ’
print*,’ ’


! ----------------------------------------------------------------------

do i = 1,n
do j = 1,n

if(i.ne.j) then

dx = abs(position(1,i)-position(1,j))
dy = abs(position(2,i)-position(2,j))

r = sqrt((dxdx) + (dydy))

ij_force = (grav_constant * mass(i) * mass(j)) / (r*r)

force(1,i) = force(1,i) + (ij_force * (dx/r))
force(2,i) = force(2,i) + (ij_force * (dy/r))

endif ! i ne j

enddo ! j
enddo ! i


! ----------------------------------------------------------------------

print*,’ ’
do b = 1,n
print*,b,force(1:2,b)
enddo
print*,’ ’


! ----------------------------------------------------------------------


return

end


! ----------------------------------------------------------------------

!

! test sub n by n and no acceleration

!

! ----------------------------------------------------------------------
subroutine test_2()

use openacc

real8 r
real
8 ij_force

real8 position(1:2,1:4)
real
8 mass(1:4)
real*8 grav_constant

real*8 force(1:2,1:4)

integer n,i,j,b


! ----------------------------------------------------------------------
! define geometry - 4 balls on a surface

n = 4

! ball 1 at x = 0, y = 0

position(1,1) = 0.0
position(2,1) = 0.0

! ball 2 at x = 2, y = 0

position(1,2) = 2.0
position(2,2) = 0.0

! ball 3 at x = 2, y = 1

position(1,3) = 2.0
position(2,3) = 1.0

! ball 4 at x = 0, y = 1

position(1,4) = 0.0
position(2,4) = 1.0


! ----------------------------------------------------------------------
! mass of each ball = 1000 Kg
! gravitational constant = 6.674e-11

mass = 1000.0

grav_constant = 6.674e-11


! initialise force = 0

force = 0.0


! ----------------------------------------------------------------------

print*,’ ’
print*,’ symmetric non accelerated ’
print*,’ ’


! ----------------------------------------------------------------------

do i = 1,n
do j = 1,n

if(j.gt.i) then

dx = abs(position(1,i)-position(1,j))
dy = abs(position(2,i)-position(2,j))

r = sqrt((dxdx) + (dydy))

ij_force = (grav_constant * mass(i) * mass(j)) / (r*r)

force(1,i) = force(1,i) + (ij_force * (dx/r))
force(2,i) = force(2,i) + (ij_force * (dy/r))

force(1,j) = force(1,j) + (ij_force * (dx/r))
force(2,j) = force(2,j) + (ij_force * (dy/r))

endif ! j gt i

enddo ! j
enddo ! i


! ----------------------------------------------------------------------

print*,’ ’
do b = 1,n
print*,b,force(1:2,b)
enddo
print*,’ ’


! ----------------------------------------------------------------------


return

end

Hi Tim,

I did actually intend that multiple i loop threads to update the same jcorr value - is this not feasible?

It is feasible but requires you to use atomic operations which can be quite slow.

I assume that this is quite a common scientific computing task - are you aware of any general guidance on how to deal with this type of problem?

Sounds like an N-Body simulation which is quite common. Many CUDA implementations and a few OpenACC.

While written in C++, you might find this article interesting:http://devblogs.nvidia.com/parallelforall/accelerating-cfd-code-openacc/

There’s also the SPEC ACCEL 350.md benchmark (https://www.spec.org/accel/Docs/350.md.html) which uses a 2-body screened Coulomb interaction. I can’t send you the code, but you’re welcome to contact the folks at the Indiana University noted in the link.

My code is a little more complex as rather than looking at every interaction it only looks at interactions between particles within a certain distance of each other so i could be a few million, p_j a few tens so parallelising the p_j loop is less effective than parallelising the i loop.

Given this you might want to not parallelize the inner “p_j” loop and just do a reduction and then use an atomic update to icorr and jcorr arrays.

For your example code, I ported two versions. The first uses reductions on the inner loop. For the symmetric test, this meant having to split the computation into two separate loops to compute the “force([1,2],i)” then “force([1,2],j)” which may or may not be any better than the non-symmetric version. Note that “non-symmetric” version used here is similar in structure on what the Indiana folks used.

The second uses the OpenACC “atomic” directive. The caveat here is that we currently only support basic “atomic” operations. We expect to support all OpenACC atomic operations in the 14.7 release but I would expect it take a few releases after that to iron out any problems.

These are by no-means the only methods. Also, you may want to combine the two ideas.

I should also note that given the small problem size, the performance relative to the host will be very poor. Though you should see speed-up once the size is scaled up.

Example #1 using reductions:

% cat nbody_red.f90
program accel_test
 call test_1()
 call test_2()
 end
! test sub n by n and no acceleration
!----------------------------------------------------------------------
 subroutine test_1()
 use openacc
 real*8 r
 real*8 ij_force
 real*8 position(1:2,1:4)
 real*8 mass(1:4)
 real*8 grav_constant
 real*8 force(1:2,1:4)
 real*8 tmp1, tmp2
 integer n,i,j,b
 !
 !----------------------------------------------------------------------
 ! define geometry - 4 balls on a surface
 n = 4
 ! ball 1 at x = 0, y = 0
 position(1,1) = 0.0
 position(2,1) = 0.0
 ! ball 2 at x = 2, y = 0
 position(1,2) = 2.0
 position(2,2) = 0.0
 ! ball 3 at x = 2, y = 1
 position(1,3) = 2.0
 position(2,3) = 1.0
 ! ball 4 at x = 0, y = 1
 position(1,4) = 0.0
 position(2,4) = 1.0

 !
 !----------------------------------------------------------------------
 ! mass of each ball = 1000 Kg
 ! gravitational constant = 6.674e-11
 mass = 1000.0
 grav_constant = 6.674e-11
 ! initialise force = 0
 force = 0.0
 !
 !----------------------------------------------------------------------
 print*,' '
 print*,' n * n non accelerated '
 print*,' '
 !----------------------------------------------------------------------

!$acc kernels loop
 do i = 1,n
    tmp1 = force(1,i)
    tmp2 = force(2,i)
!$acc loop reduction (+:tmp1,tmp2)
    do j = 1,n
      if(i.ne.j) then
         dx = abs(position(1,i)-position(1,j))
         dy = abs(position(2,i)-position(2,j))
         r = sqrt((dx*dx) + (dy*dy))
         ij_force = (grav_constant * mass(i) * mass(j)) / (r*r)
         tmp1 = tmp1 + (ij_force * (dx/r))
         tmp2 = tmp2 + (ij_force * (dy/r))
      endif ! i ne j
    enddo ! i
    force(1,i) = tmp1
    force(2,i) = tmp2
 enddo !j

 !
 !----------------------------------------------------------------------

 print*,' '
 do b = 1,n
 print*,b,force(1:2,b)
 enddo
 print*,' '

 return
 end


 !
 ! test sub n by n and no acceleration
 !
 !----------------------------------------------------------------------
 subroutine test_2()

 use openacc

 real*8 r
 real*8 ij_force

 real*8 position(1:2,1:4)
 real*8 mass(1:4)
 real*8 grav_constant

 real*8 force(1:2,1:4)
 real*8 tmp,tmp1,tmp2
 integer n,i,j,b

 !----------------------------------------------------------------------
 ! define geometry - 4 balls on a surface

 n = 4

 ! ball 1 at x = 0, y = 0

 position(1,1) = 0.0
 position(2,1) = 0.0

 ! ball 2 at x = 2, y = 0

 position(1,2) = 2.0
 position(2,2) = 0.0

 ! ball 3 at x = 2, y = 1

 position(1,3) = 2.0
 position(2,3) = 1.0

 ! ball 4 at x = 0, y = 1

 position(1,4) = 0.0
 position(2,4) = 1.0


 !
 !----------------------------------------------------------------------
 ! mass of each ball = 1000 Kg
 ! gravitational constant = 6.674e-11

 mass = 1000.0

 grav_constant = 6.674e-11

 ! initialise force = 0
 force = 0.0
 !----------------------------------------------------------------------
 print*,' '
 print*,' symmetric non accelerated '
 print*,' '
 !----------------------------------------------------------------------

!$acc kernels loop
 do i = 1,n
    tmp1 = 0.0
    tmp2 = 0.0
!$acc loop independent reduction(+:tmp1,tmp2)
    do j = 1,n
      if(j.gt.i) then
        dx = abs(position(1,i)-position(1,j))
        dy = abs(position(2,i)-position(2,j))
        r = sqrt((dx*dx) + (dy*dy))
        ij_force = (grav_constant * mass(i) * mass(j)) / (r*r)
        tmp1  = tmp1 + (ij_force * (dx/r))
        tmp2  = tmp2 + (ij_force * (dy/r))
      endif ! j gt i
    enddo ! j
    force(1,i) = force(1,i) + tmp1
    force(2,i) = force(2,i) + tmp2
 enddo !i
 !----------------------------------------------------------------------
!$acc kernels loop
do j = 1,n
    tmp1 = 0.0
    tmp2 = 0.0
!$acc loop independent reduction(+:tmp1,tmp2)
  do i = 1,n
    if(j.gt.i) then
        dx = abs(position(1,i)-position(1,j))
        dy = abs(position(2,i)-position(2,j))
        r = sqrt((dx*dx) + (dy*dy))
        ij_force = (grav_constant * mass(i) * mass(j)) / (r*r)
        tmp1 = tmp1 + (ij_force * (dx/r))
        tmp2 = tmp2 + (ij_force * (dy/r))
      endif ! j gt i
    enddo ! i
    force(1,j) = force(1,j) + tmp1
    force(2,j) = force(2,j) + tmp2
 enddo !j

 !----------------------------------------------------------------------
 print*,' '
 do b = 1,n
 print*,b,force(1:2,b)
 enddo
 print*,' '
 !----------------------------------------------------------------------
 return
 end

Example #2 using atomics. Note compile either for just the device (i.e. -ta=tesla) or add “-lcudadevice” or “-Mcuda” to get the host versions of the atomic routines. By default we target both NVIDIA GPU and host code (i.e. -ta=tesla,host)

program accel_test
 call test_1()
 call test_2()
 end
! test sub n by n and no acceleration
!----------------------------------------------------------------------
 subroutine test_1()
 use openacc
 real*8 r
 real*8 ij_force
 real*8 position(1:2,1:4)
 real*8 mass(1:4)
 real*8 grav_constant
 real*8 force(1:2,1:4)
 real*8 tmp
 integer n,i,j,b
 !
 !----------------------------------------------------------------------
 ! define geometry - 4 balls on a surface
 n = 4
 ! ball 1 at x = 0, y = 0
 position(1,1) = 0.0
 position(2,1) = 0.0
 ! ball 2 at x = 2, y = 0
 position(1,2) = 2.0
 position(2,2) = 0.0
 ! ball 3 at x = 2, y = 1
 position(1,3) = 2.0
 position(2,3) = 1.0
 ! ball 4 at x = 0, y = 1
 position(1,4) = 0.0
 position(2,4) = 1.0

 !
 !----------------------------------------------------------------------
 ! mass of each ball = 1000 Kg
 ! gravitational constant = 6.674e-11
 mass = 1000.0
 grav_constant = 6.674e-11
 ! initialise force = 0
 force = 0.0
 !
 !----------------------------------------------------------------------
 print*,' '
 print*,' n * n non accelerated '
 print*,' '
 !----------------------------------------------------------------------

!$acc kernels loop independent copy(force)
 do i = 1,n
!$acc loop independent
    do j = 1,n
      if(i.ne.j) then
         dx = abs(position(1,i)-position(1,j))
         dy = abs(position(2,i)-position(2,j))
         r = sqrt((dx*dx) + (dy*dy))
         ij_force = (grav_constant * mass(i) * mass(j)) / (r*r)
!$acc atomic
         force(1,i) = force(1,i) + (ij_force * (dx/r))
!$acc atomic
         force(2,i) = force(2,i) + (ij_force * (dy/r))
      endif ! i ne j
    enddo ! i
 enddo !j

 !
 !----------------------------------------------------------------------
 print*,' '
 do b = 1,n
 print*,b,force(1:2,b)
 enddo
 print*,' '

 return
 end


 !
 ! test sub n by n and no acceleration
 !
 !----------------------------------------------------------------------
 subroutine test_2()

 use openacc

 real*8 r
 real*8 ij_force

 real*8 position(1:2,1:4)
 real*8 mass(1:4)
 real*8 grav_constant

 real*8 force(1:2,1:4)

 integer n,i,j,b

 !----------------------------------------------------------------------
 ! define geometry - 4 balls on a surface

 n = 4

 ! ball 1 at x = 0, y = 0

 position(1,1) = 0.0
 position(2,1) = 0.0

 ! ball 2 at x = 2, y = 0

 position(1,2) = 2.0
 position(2,2) = 0.0

 ! ball 3 at x = 2, y = 1

 position(1,3) = 2.0
 position(2,3) = 1.0

 ! ball 4 at x = 0, y = 1

 position(1,4) = 0.0
 position(2,4) = 1.0


 !
 !----------------------------------------------------------------------
 ! mass of each ball = 1000 Kg
 ! gravitational constant = 6.674e-11

 mass = 1000.0

 grav_constant = 6.674e-11

 ! initialise force = 0
 force = 0.0
 !----------------------------------------------------------------------
 print*,' '
 print*,' symmetric non accelerated '
 print*,' '
 !----------------------------------------------------------------------

!$acc kernels loop independent copy(force)
 do i = 1,n
!$acc loop independent
   do j = 1,n
     if(j.gt.i) then
       dx = abs(position(1,i)-position(1,j))
       dy = abs(position(2,i)-position(2,j))
       r = sqrt((dx*dx) + (dy*dy))
       ij_force = (grav_constant * mass(i) * mass(j)) / (r*r)
!$acc atomic
       force(1,i) = force(1,i) + (ij_force * (dx/r))
!$acc atomic
       force(2,i) = force(2,i) + (ij_force * (dy/r))
!$acc atomic
       force(1,j) = force(1,j) + (ij_force * (dx/r))
!$acc atomic
       force(2,j) = force(2,j) + (ij_force * (dy/r))
     endif ! j gt i
    enddo ! i
 enddo !j
 !----------------------------------------------------------------------
 print*,' '
 do b = 1,n
 print*,b,force(1:2,b)
 enddo
 print*,' '
 !----------------------------------------------------------------------
 return
 end

Mat - again, thank you ever so much for your assistance with this.

I hadn’t considered the need for atomic operations but this really makes sense.

I am having some issues getting your code to compile - I am getting a message PGF90-W-0287 unrecognized ACC Directive - Atomic. I have tried the flags you list to no avail. I am using pgfortran 14.3 and will check that this is the most up-to-date version I can use.

Thanks again.

Tim.

Hi Tim,

The first release with basic atomic support was 14.4 and 14.7 (due out mid-July) is expect to have complete support.

  • Mat

Hi,

Looks like I can get 14.6 now - have asked my IT team to install this then I should be good to go.

Thanks for your help!

Tim.

Hi Mat,

Sorry for the follow up on this. Apparently, I don’t have access to version 14.6 despite only recently buying the compiler.

If I have a parallel - non independent - loop with no atomic statement I appear to get the correct results. As far as I understand things, the atomic statement only allows one thread to write to a memory location at a time.

I assume the code I have functions by luck because I don’t get two threads writing to the same place at the same time?

Could you let me know what happens when I do get this behaviour - will there be a crash or error statement - or will one of the operations fail to be performed correctly?

Essentially, is the lack of an atomic statement something I can turn a blind eye to until I can get access to the version with this capability?

Thanks in advance

Tim.

I assume the code I have functions by luck because I don’t get two threads writing to the same place at the same time?

Could you let me know what happens when I do get this behaviour - will there be a crash or error statement - or will one of the operations fail to be performed correctly?

I would expect your code to get wrong answers but race conditions can be non-deterministic depending on the order the threads are run in. If it’s “luck” or if the race condition doesn’t actually occur, would be something you’d need investigate.

Essentially, is the lack of an atomic statement something I can turn a blind eye to until I can get access to the version with this capability?

Ideally you’d like to remove the potential race-condition and only use atomics if the algorithm requires it. Atomics will have a negative performance impact so even if a version that removes race-conditions does more work, it may faster than the atomic version. No guarantee, but it’s something to investigate.

I hesitate to advise turning a blind-eye but so long as you’re aware you have a race-condition and verification problem it should be ok. Granted if you start get verification issues it will be hard to be sure if they’re coming from the new code or the race condition.

  • Mat

Hi Mat,

Thanks for your reply. I guess I will need to rewrite my code to try to get around this.

Tim.

TPR 20573 - OpenACC fortran nbody tests fails to compile with atomics due to “unsupported operation”
is fixed in the current 14.7 release.

thanks,
dave