# different results for CPU and GPU

It is very strange when I do some parallel computing with acc. The results in CPU and GPU are different in second iterator
Here is the code

``````!\$acc data copy(u,v,w,lx,ly,lz,mx,my,mz,c,r1,r2,r3,r4,r5,
!\$acc& ub,kx,ky,kz,gm1)
!\$acc kernels loop present(u,v,w,lx,ly,lz,mx,my,mz,c,ub,kx,ky,kz,
!\$acc& r1,r2,r3,r4,r5,gm1) private(m,vb,wb,q2,t3,t1,t2)
do 1000 m=1,n
vb    =  u(m)*lx(m)+v(m)*ly(m)+w(m)*lz(m)
wb    =  u(m)*mx(m)+v(m)*my(m)+w(m)*mz(m)
c
q2    =  0.5e0*(u(m)*u(m)+v(m)*v(m)+w(m)*w(m))
c
t3    =  1.e0/c(m)
t1    = -q2*r1(m)-r5(m)+u(m)*r2(m)+v(m)*r3(m)+w(m)*r4(m)
t1    =  gm1*t1*t3*t3
t2    = -ub(m)*r1(m)+kx(m)*r2(m)+ky(m)*r3(m)+kz(m)*r4(m)
t2    =  t2*t3
t3    = -vb*r1(m)+lx(m)*r2(m)+ly(m)*r3(m)+lz(m)*r4(m)
c
r3(m) = -wb*r1(m)+mx(m)*r2(m)+my(m)*r3(m)+mz(m)*r4(m)
r1(m) =  r1(m)+t1
r2(m) =  t3
r4(m) =  0.5e0*(t2-t1)
r5(m) =  r4(m)-t2
1000 continue
!\$acc end parallel
!\$acc end data
``````

“gm1” in the code is a common variable.
I hava also tried !\$acc parallel loop, !\$acc parallel loop seq. They are all the same results with above code.Besides, I compile the code with -r8 option.
The compile information is as follows:

``````with parallel loop
75, Generating copy(r5(:),ub(:),v(:),kz(:),lx(:),ly(:),lz(:),mx(:),my(:),mz(:),r1(:),r2(:),r3(:),r4(:),u(:),c(:),kx(:),ky(:),w(:))
76, Generating present(r5(:),ub(:),w(:),v(:),kz(:),lx(:),ly(:),lz(:),mx(:),my(:),u(:),c(:),kx(:),ky(:),mz(:),r1(:),r2(:),r3(:),r4(:))
78, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
78, !\$acc loop gang, vector(128) ! blockidx%x threadidx%x

with parallel loop seq
75, Generating copy(r5(:),ub(:),v(:),kz(:),lx(:),ly(:),lz(:),mx(:),my(:),mz(:),r1(:),r2(:),r3(:),r4(:),u(:),c(:),kx(:),ky(:),w(:))
79, Generating present(r5(:),ub(:),w(:),v(:),kz(:),lx(:),ly(:),lz(:),mx(:),my(:),u(:),c(:),kx(:),ky(:),mz(:),r1(:),r2(:),r3(:),r4(:))
Accelerator kernel generated
Generating Tesla code
81, !\$acc loop seq
``````

When I add some acc clause with other part of the code like below. The results for CPU and GPU are the same.

``````do 10001 m=1,n

t1    =  1.e0/c(m)
rrho = 1.e0/rho(m)
xm2 = xm2a(m)*t1
xm2ar = 1.0/xm2a(m)
fplus = (eig2(m)-ub(m))*xm2ar
fmins = -(eig3(m)-ub(m))*xm2ar

r11 = r1(m)
r21 = r2(m)
r31 = r3(m)
r41 = r4(m)
r51 = r5(m)

vmag1 = u(m)**2 + v(m)**2 + w(m)**2
r5t = gm1*(0.5*vmag1*r11
.    - (u(m)*r21 + v(m)*r31 + w(m)*r41) + r51)
c
c ---- multiplication by inverse of precond. matrix
c
r1(m) = r11 - (1.-xm2)*r5t*t1*t1
r2(m) = rrho*(-u(m)*r11 + r21)
r3(m) = rrho*(-v(m)*r11 + r31)
r4(m) = rrho*(-w(m)*r11 + r41)
r5(m) = xm2*r5t
c
c ---- multiplication by T(inverse)
c
r5t =  r5(m)*t1*t1
r1(m) =  r1(m)-r5t
c
t2    =  rho(m)*r2(m)
t3    =  rho(m)*r3(m)
t4    =  rho(m)*r4(m)
c
r2(m) =           lx(m)*t2+ly(m)*t3+lz(m)*t4
r3(m) =           mx(m)*t2+my(m)*t3+mz(m)*t4
r4(m) =  0.5*(t1*(kx(m)*t2+ky(m)*t3+kz(m)*t4)
.       + r5t*fplus)
r5(m) =  -0.5*(t1*(kx(m)*t2+ky(m)*t3+kz(m)*t4)
.       - r5t*fmins)

10001 continue
``````

Hi xll_blt,

I hava also tried !\$acc parallel loop, !\$acc parallel loop seq. They are all the same results with above code.

Since it also gets incorrect results when run sequentially on the device, it’s most likely not a data race but more likely a problem with synchronization of the device and host data.

Do you use another data region someplace higher in the code? If so the the data region here will be ignored for all variables already in another data region so the “r” arrays aren’t copied out. Instead, you’ll want to use an “update” directive with a “self” clause to copy the data back to the host.

If this is the only data region in which these variables appear, then I’m not sure. There’s nothing obvious in the code you show that would cause this error. I’ll need a reproducing example.

A couple of notes unrelated to the problem:

• Since the parallel region is within a structured data region, you don’t need to use the “present” clause. It doesn’t hurt, but is just extraneous.
• Since scalars are private by default, you most likely don’t need the “private” clause as well. The only time private is needed for scalars is when the scalar has a global reference (i.e. if the scalar is a module variable) or if the scalar is passed by reference (default in Fortran) to a device subroutine.

So you should be able to simplify the code to something like:

``````!\$acc data copy(u,v,w,lx,ly,lz,mx,my,mz,c,r1,r2,r3,r4,r5,
!\$acc& ub,kx,ky,kz,gm1)
!\$acc parallel loop
``````

Finally, the “!\$acc end parallel” directive isn’t needed either since you use “parallel loop” which will defines the structured block to be the next loop. You only need to end the region if you didn’t use “loop”. Doesn’t hurt to include the “end parallel”, it’s just extraneous.

-Mat

Hi Mat,
I did another try and set the only data region for this part of the code. But it got the same incorrect results.
Since it is small part(tinvr.F) of a big projectï¼ˆNASA CFL3Dï¼‰, I didn’t find a good way to reproduce it.

One thing we can try is the follow which will make sure all the arrays are synchronized before and after the compute region. We can also try running the the “serial” directives to ensure the code is run serially on the device. Though, this is the same as saying “parallel loop seq” so may not matter. (see below)

One question is how “wrong” are the results? If they are not too far off, one possibility is the the device will use FMA instructions by default where the CPU may or may not use them depending on the architecture. Hence, try adding the flag “-Mnofma” to disable FMA.

``````!\$acc data create(u,v,w,lx,ly,lz,mx,my,mz,c,r1,r2,r3,r4,r5,
!\$acc& ub,kx,ky,kz,gm1)
!\$acc update device(u,v,w,lx,ly,lz,mx,my,mz,c,r1,r2,r3,r4,r5,
!\$acc& ub,kx,ky,kz,gm1)
!\$acc serial
do 1000 m=1,n
vb    =  u(m)*lx(m)+v(m)*ly(m)+w(m)*lz(m)
.. cut ..
r4(m) =  0.5e0*(t2-t1)
r5(m) =  r4(m)-t2
1000 continue
!\$acc end serial
!\$acc update self(u,v,w,lx,ly,lz,mx,my,mz,c,r1,r2,r3,r4,r5,
!\$acc& ub,kx,ky,kz)
!\$acc end data
``````

Hi, Mat
Great, it works with “serial” directives. The “-Mnofma” did not work. Thank you very much.
By the way, what is the different between “serial” directives and “parallel loop seq”?

``````By the way, what is the different between "serial" directives and "parallel loop seq"?
``````

They do use different planners since parallel will analyze the remaining code to see if any additional loops can be parallelized while serial simply creates the device code. So there could be some code generation difference even though there’s no inner loops to auto-parallelize.

I was able to check CF3L from git and I’m seeing if I can recreate the problem. I am attending a conference in Germany this week so have limited time. If I’m not able determine the issue quickly, I’ll work on it when I get back to the office next week.

-Mat

Thanksï¼Œ I put a openacc version of CFL3D on github, so you can reproduce the results easily.
Here is the code: https://github.com/xllbit/CFL3D_ACC

I have solved the problem. The variable “gm1” is a declared a common variable in module. So I should update first before use. when I added “!\$acc update device(gm1)”, the result was correct.

Excellent! I’m glad you found the issue.