different results for CPU and GPU

It is very strange when I do some parallel computing with acc. The results in CPU and GPU are different in second iterator
Here is the code

!$acc data copy(u,v,w,lx,ly,lz,mx,my,mz,c,r1,r2,r3,r4,r5,
!$acc& ub,kx,ky,kz,gm1)        
!$acc kernels loop present(u,v,w,lx,ly,lz,mx,my,mz,c,ub,kx,ky,kz,
!$acc& r1,r2,r3,r4,r5,gm1) private(m,vb,wb,q2,t3,t1,t2)         
      do 1000 m=1,n
      vb    =  u(m)*lx(m)+v(m)*ly(m)+w(m)*lz(m)
      wb    =  u(m)*mx(m)+v(m)*my(m)+w(m)*mz(m)
c
      q2    =  0.5e0*(u(m)*u(m)+v(m)*v(m)+w(m)*w(m))
c
      t3    =  1.e0/c(m)
      t1    = -q2*r1(m)-r5(m)+u(m)*r2(m)+v(m)*r3(m)+w(m)*r4(m)
      t1    =  gm1*t1*t3*t3
      t2    = -ub(m)*r1(m)+kx(m)*r2(m)+ky(m)*r3(m)+kz(m)*r4(m)
      t2    =  t2*t3
      t3    = -vb*r1(m)+lx(m)*r2(m)+ly(m)*r3(m)+lz(m)*r4(m)
c
      r3(m) = -wb*r1(m)+mx(m)*r2(m)+my(m)*r3(m)+mz(m)*r4(m)
      r1(m) =  r1(m)+t1
      r2(m) =  t3
      r4(m) =  0.5e0*(t2-t1)
      r5(m) =  r4(m)-t2
 1000 continue
!$acc end parallel      
!$acc end data

“gm1” in the code is a common variable.
I hava also tried !$acc parallel loop, !$acc parallel loop seq. They are all the same results with above code.Besides, I compile the code with -r8 option.
The compile information is as follows:

with parallel loop
     75, Generating copy(r5(:),ub(:),v(:),kz(:),lx(:),ly(:),lz(:),mx(:),my(:),mz(:),r1(:),r2(:),r3(:),r4(:),u(:),c(:),kx(:),ky(:),w(:))
     76, Generating present(r5(:),ub(:),w(:),v(:),kz(:),lx(:),ly(:),lz(:),mx(:),my(:),u(:),c(:),kx(:),ky(:),mz(:),r1(:),r2(:),r3(:),r4(:))
     78, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         78, !$acc loop gang, vector(128) ! blockidx%x threadidx%x

with parallel loop seq
     75, Generating copy(r5(:),ub(:),v(:),kz(:),lx(:),ly(:),lz(:),mx(:),my(:),mz(:),r1(:),r2(:),r3(:),r4(:),u(:),c(:),kx(:),ky(:),w(:))
     79, Generating present(r5(:),ub(:),w(:),v(:),kz(:),lx(:),ly(:),lz(:),mx(:),my(:),u(:),c(:),kx(:),ky(:),mz(:),r1(:),r2(:),r3(:),r4(:))
         Accelerator kernel generated
         Generating Tesla code
         81, !$acc loop seq

When I add some acc clause with other part of the code like below. The results for CPU and GPU are the same.

do 10001 m=1,n

      t1    =  1.e0/c(m)
      rrho = 1.e0/rho(m)
      xm2 = xm2a(m)*t1
      xm2ar = 1.0/xm2a(m)
      fplus = (eig2(m)-ub(m))*xm2ar
      fmins = -(eig3(m)-ub(m))*xm2ar
 
      r11 = r1(m)
      r21 = r2(m)
      r31 = r3(m)
      r41 = r4(m)
      r51 = r5(m)

      vmag1 = u(m)**2 + v(m)**2 + w(m)**2
      r5t = gm1*(0.5*vmag1*r11 
     .    - (u(m)*r21 + v(m)*r31 + w(m)*r41) + r51) 
c
c ---- multiplication by inverse of precond. matrix
c
      r1(m) = r11 - (1.-xm2)*r5t*t1*t1 
      r2(m) = rrho*(-u(m)*r11 + r21)
      r3(m) = rrho*(-v(m)*r11 + r31)
      r4(m) = rrho*(-w(m)*r11 + r41)
      r5(m) = xm2*r5t
c
c ---- multiplication by T(inverse)
c
      r5t =  r5(m)*t1*t1
      r1(m) =  r1(m)-r5t
c
      t2    =  rho(m)*r2(m)
      t3    =  rho(m)*r3(m)
      t4    =  rho(m)*r4(m)
c
      r2(m) =           lx(m)*t2+ly(m)*t3+lz(m)*t4
      r3(m) =           mx(m)*t2+my(m)*t3+mz(m)*t4
      r4(m) =  0.5*(t1*(kx(m)*t2+ky(m)*t3+kz(m)*t4)
     .       + r5t*fplus)
      r5(m) =  -0.5*(t1*(kx(m)*t2+ky(m)*t3+kz(m)*t4)
     .       - r5t*fmins)

10001 continue

Hi xll_blt,

I hava also tried !$acc parallel loop, !$acc parallel loop seq. They are all the same results with above code.

Since it also gets incorrect results when run sequentially on the device, it’s most likely not a data race but more likely a problem with synchronization of the device and host data.

Do you use another data region someplace higher in the code? If so the the data region here will be ignored for all variables already in another data region so the “r” arrays aren’t copied out. Instead, you’ll want to use an “update” directive with a “self” clause to copy the data back to the host.

If this is the only data region in which these variables appear, then I’m not sure. There’s nothing obvious in the code you show that would cause this error. I’ll need a reproducing example.

A couple of notes unrelated to the problem:

  • Since the parallel region is within a structured data region, you don’t need to use the “present” clause. It doesn’t hurt, but is just extraneous.
  • Since scalars are private by default, you most likely don’t need the “private” clause as well. The only time private is needed for scalars is when the scalar has a global reference (i.e. if the scalar is a module variable) or if the scalar is passed by reference (default in Fortran) to a device subroutine.

So you should be able to simplify the code to something like:

!$acc data copy(u,v,w,lx,ly,lz,mx,my,mz,c,r1,r2,r3,r4,r5, 
!$acc& ub,kx,ky,kz,gm1)        
!$acc parallel loop

Finally, the “!$acc end parallel” directive isn’t needed either since you use “parallel loop” which will defines the structured block to be the next loop. You only need to end the region if you didn’t use “loop”. Doesn’t hurt to include the “end parallel”, it’s just extraneous.

-Mat

Hi Mat,
Thanks for your detailed reply.
I did another try and set the only data region for this part of the code. But it got the same incorrect results.
Since it is small part(tinvr.F) of a big project(NASA CFL3D), I didn’t find a good way to reproduce it.

One thing we can try is the follow which will make sure all the arrays are synchronized before and after the compute region. We can also try running the the “serial” directives to ensure the code is run serially on the device. Though, this is the same as saying “parallel loop seq” so may not matter. (see below)

One question is how “wrong” are the results? If they are not too far off, one possibility is the the device will use FMA instructions by default where the CPU may or may not use them depending on the architecture. Hence, try adding the flag “-Mnofma” to disable FMA.


!$acc data create(u,v,w,lx,ly,lz,mx,my,mz,c,r1,r2,r3,r4,r5, 
!$acc& ub,kx,ky,kz,gm1)        
!$acc update device(u,v,w,lx,ly,lz,mx,my,mz,c,r1,r2,r3,r4,r5, 
!$acc& ub,kx,ky,kz,gm1)  
!$acc serial      
      do 1000 m=1,n 
      vb    =  u(m)*lx(m)+v(m)*ly(m)+w(m)*lz(m) 
.. cut ..
      r4(m) =  0.5e0*(t2-t1) 
      r5(m) =  r4(m)-t2 
 1000 continue 
!$acc end serial
 !$acc update self(u,v,w,lx,ly,lz,mx,my,mz,c,r1,r2,r3,r4,r5, 
!$acc& ub,kx,ky,kz)  
!$acc end data

Hi, Mat
Great, it works with “serial” directives. The “-Mnofma” did not work. Thank you very much.
By the way, what is the different between “serial” directives and “parallel loop seq”?

By the way, what is the different between "serial" directives and "parallel loop seq"?

They do use different planners since parallel will analyze the remaining code to see if any additional loops can be parallelized while serial simply creates the device code. So there could be some code generation difference even though there’s no inner loops to auto-parallelize.

I was able to check CF3L from git and I’m seeing if I can recreate the problem. I am attending a conference in Germany this week so have limited time. If I’m not able determine the issue quickly, I’ll work on it when I get back to the office next week.

-Mat

Thanks, I put a openacc version of CFL3D on github, so you can reproduce the results easily.
Here is the code: https://github.com/xllbit/CFL3D_ACC

I have solved the problem. The variable “gm1” is a declared a common variable in module. So I should update first before use. when I added “!$acc update device(gm1)”, the result was correct.

Excellent! I’m glad you found the issue.