Questions on incorrect results with openacc in GPU

It’s not unusual to see slight differences. Are they within an acceptable tolerance?

There’s quite a bit that can cause numerical differences of floating point numbers but my guess here is that it’s due to order of operations and rounding error. Particularly operations such as reductions or atomics will change the order in which these operations are applied. Due to small rounding present in FP, changing the order will give slightly different results. For a relatively small number of threads, the error difference may not be noticed, but going massively parallel may show more differences.

Other things that can cause differences are math intrinsics (sin, cos, etc.), FMA operations, the data type’s precision (float vs double), the data (very big or very small number may not be able to be accurately represented), the compiler optimizations being applied, among others.

Some of these we can adjust (like the opt level or use of FMA or not), but differences due to parallel order of operations can’t be helped.

Hi Mat, Ive just used some of your comments to modify my code, because Im also getting different (unconverged) results when running on GPU, compared with my (converged) CPU runs. I used the Minfo=accel option to get some idea what might need adding.

The really interesting part (which I wasn’t expecting) when I added some variables into the data copy statement, the code ran at least 10 times faster on the GPU (compared with multicore on CPU). I dont know why yet - it was taking about same time before on GPU and CPU multicore.

Here is the Minfo=accel output. Im not sure where to start…?

flow_solve:
2032, Generating copy(m(:),nei(:,:),oc(:),wigg(:),w(:),t(:),u(:),v(:),oleaf(:),p(:),r(:)) [if not already present]
2048, Generating copy(resid(:)) [if not already present]
Generating implicit copy(maxwig) [if not already present]
Generating NVIDIA GPU code
2056, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
Generating reduction(max:maxwig)
Generating reduction(+:resid(:))
Generating reduction(max:maxvel)
2071, !$acc loop seq
2162, !$acc loop seq
2558, !$acc loop seq
2048, Local memory used for pn,e1,e2,f1,f2,g1,resid,tn,un,vn,wn,mn,rn,g2,q
Generating implicit copy(maxvel) [if not already present]
2056, Generating implicit private(rsum,avg1,nb5,nb4,nb3,nb2,nb1,mwig,msum,zwig,ywig,xwig,mag2,dt,dotp,v2,qzz,tzz,tzy,tzx,tyz,txz,et,mm,magvel,pp,owid,tt,tsum,ww,wsum,vv,vsum,uu,usum,rr,qyy,tyy,tyx,txy,qxx,psum,txx,dtdz,dwdz,dvdz,dudz,dtdy,dwdy,dvdy,dudy,dtdx,dwdx,dvdx,dudx,dtloc,tweak,mag1,wnew,vnew,vmag,unew,diff1,ccln,wnew1,vnew1,unew1,wnew2,vnew2,unew2,diff2,nb6,nrm2,nrm1,nmag,wnrm,vnrm,unrm,wold,vold,uold,srinv,avg2,olfn,nrm3,olev)
2071, Loop is parallelizable
Generating implicit private(nb)
2162, Scalar last value needed after loop for cs at line 2554
Generating implicit private(nb8,nb7,nb6,nb5,nb4,nb3,nb2,nb1,nb,cs,ma)
Scalar last value needed after loop for cs at line 2554
2302, Reference argument passing prevents parallelization:
2558, Loop is parallelizable
Generating implicit private(rhs,gflux,fflux,eflux)

Regards Giles.

Might have been some extra data movement. For example if the offloaded loop at line 2056 was inside another loop (like a timestep loop), then hoisting the data copy before it, will help. Otherwise, the arrays get copied each iteration of the timestep.

Is it something to do with private vs firstprivate, and values accidentally being re-initialised?

“private” variables aren’t initialized while “firstprivate” variables are initialized using the value from the host.

By default with the “parallel” construct, scalar variables are “firstprivate”. If any scalar should be shared instead, then you’d need to explicitly add them to a “copy” clause.

because Im also getting different (unconverged) results when running on GPU,

Unfortunately, I can tell from the feedback what would cause this.

Are any of the arrays used on the host within the new data region? If so, you may need to use the “update” directive to synchronize the data.

If not, then can you post a code snip-it (starting at line 2032)?

Generating reduction(+:resid(:))

How big in number of elements is resid? Personally I don’t like using array reductions unless the array is very small. Larger arrays can create a lot of overhead so it’s better to use atomics.

  • Mat

Thanks Mat. I saw something about -gpu=managed flag, I wonder if that will help?
Not sure how to use the update command yet.

I think because the flow variables (all allocated) are updated within the parallel loop,
I may need to update them. Is it done after the parallel loop like this?

!$acc update self(r,u,v,w,p,r,t)

The resid array is only 5 numbers, so that should be ok.

Sure. Basically what happens is allocations are put in CUDA Unified Memory (UM) which creates a common address which can be accessed on both the host and device. Hence there’s no need to use data or update directives. Under the hood, the CUDA driver is still moving the data, but you don’t need to manage it. Though for static data, you still need to manually manage the data via directives.

Is it done after the parallel loop like this?

Correct. Since you’re using Fortran F90 allocatables where the bounds information is part of the array itself, you don’t need to include the array sizes. For F77 assumed-size arrays or C/C++ pointers, then you’d need to include the bounds info via triplet notation, i.e. “update self(r(1:N),u(1:M), … etc.”

Now I suspect once you either add the update directive or use managed memory, you’re performance will regress back to the slow time. Most likely there will be a lot of data movement between the host and device which will slow things down.

To fix, you’ll want to offload the rest of the code within the convergence/timestep loop which touches these arrays. Ideally the arrays are copied once at the beginning of the program and results copied back only at the end. Of course depending on the algorithm you may need to bring the data back to the host, but the fewer the better.

You should also become familiar with profiling which will show how much time is being spent copying data and computing on the device.

Setting the environment variable “NV_ACC_TIME=1”, will give a quick profile but can’t track UM so only use if you’re using data directives. Nsight-Systems is a very nice profiler which can be run from the command line (nsys profile a.out) and then visualized via a GUI. You can get a text report as well, but the timeline in the GUI is very helpful. It’s not too hard, but may take some effort to learn so please refer to the documentation.

Hope this helps,
Mat

ok. its still running “fast”, but the GPU run is still not converging compared with the CPU multicore run, so I need to figure out what the difference is. The -Minfo=accel was useful though because it helped me to spot a bug which I hadn’t noticed before, and wasn’t being picked up by the compiler. So what else can I look for to spot the difference?

My compiler command is:

pgf90 -static-nvidia -O2 -i8 -acc -gpu=cc75 -o ufo-octree-tesla octree08.for

My code snippet looks like this:

!$acc data copy(nei,oc,oleaf,wigg,u,v,w,p,r,t,m)

C-----start iterations
DO it = 1,nits

!$acc parallel loop present(r,u,v,w,p,t,m)
!$acc& private(Q,E1,E2,F1,F2,G1,G2)
!$acc& private(rn,un,vn,wn,pn,tn,mn)
!$acc& reduction(+: resid)
!$acc& reduction(max: maxvel,maxwig)

C-----loop through nleaf cells
DO n = 1,nleaf

C-----
C do all the calculations which
C give rise to new values for
C allocated flow variables (r,u,v,w,p,t,m)
C-----

  ENDDO  !  leaf cells loop

  ENDDO  !  main iteration

!$acc update self(r,u,v,w,p,t,m)

!$acc end data

The output from Minfo (may not help but…)

flow_solve:
2032, Generating copy(m(:),nei(:,:),oc(:),w(:),wigg(:),t(:),u(:),v(:),oleaf(:),p(:),r(:)) [if not already present]
2048, Generating present(p(:),t(:),u(:),v(:),w(:),m(:),r(:))
Generating copy(resid(:)) [if not already present]
Generating implicit copy(maxwig) [if not already present]
Generating NVIDIA GPU code
2056, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
Generating reduction(max:maxwig)
Generating reduction(+:resid(:))
Generating reduction(max:maxvel)
2071, !$acc loop seq
2162, !$acc loop seq
2558, !$acc loop seq
2048, Local memory used for pn,e1,e2,f1,f2,g1,resid,tn,un,wn,vn,mn,rn,g2,q
Generating implicit copy(maxvel) [if not already present]
2056, Generating implicit private(rsum,avg1,nb5,nb4,nb3,nb2,nb1,mwig,msum,zwig,ywig,xwig,mag2,dt,dotp,ccln,v2,qzz,tzz,tzy,tzx,tyz,txz,et,mm,magvel,pp,owid,tt,tsum,ww,wsum,vv,vsum,uu,usum,rr,qyy,tyy,tyx,txy,qxx,psum,txx,dtdz,dwdz,dvdz,dudz,dtdy,dwdy,dvdy,dudy,dtdx,dwdx,dvdx,dudx,dtloc,tweak,mag1,wnew,vnew,vmag,unew,diff1,cs,wnew1,vnew1,unew1,wnew2,vnew2,unew2,diff2,nb6,nrm2,nrm1,nmag,wnrm,vnrm,unrm,wold,vold,uold,srinv,avg2,olfn,nrm3,olev)
2071, Loop is parallelizable
Generating implicit private(nb)
2162, Generating implicit private(nb8,nb7,nb6,nb5,nb4,nb3,nb2,nb1,nb,cs,ma)
2302, Reference argument passing prevents parallelization:
2558, Loop is parallelizable
Generating implicit private(rhs,gflux,fflux,eflux)
2694, Generating update self(r(:),m(:),w(:),v(:),u(:),t(:),p(:))

Is there host code that uses these arrays between:

DO it = 1,nits
 ! HERE ????
!$acc parallel loop present(r,u,v,w,p,t,m)

or

  ENDDO  !  leaf cells loop
! HERE ???
  ENDDO  !  main iteration

Does it converge if you use UM (i.e. -gpu=cc75,managed)?

If so, then it’s likely a host/device data synchronization issue and you need additional update directives.

Otherwise I’d look to the nleaf loop for race conditions (i.e. add atomics), if something should be privatized, or possibly needs to be shared.

One way to check is to change “parallel loop” to “serial loop”. The kernel will be very slow since it will now run sequentially on the device, but will indicate if there’s a problem with the parallelization.

Granted, you said that it worked before and all you did was to move the data region outside of the iteration loop, so that’s why I’m still leaning towards some type of data synchronization issue.

1 Like

ok great thanks Mat - I will investigate your comments further…

but the ‘weird’ thing, is that the convergence between CPU (top) and GPU (bottom) is not that different, except for the rhoW (w-velocity) which is the one not converging, at all on GPU. So whatever the cause of the problem is, it only appears to be affecting that array (w)

There isn’t any significant code between those sections…
(not sure about race conditions or how to add atomics)
…so Im trying the “serial loop” suggestion next

the “serial loop” just gave a load of NaNs for the convergence output (on GPU)
whereas its seems to run ok on the CPU (with serial loop)

How do you specify something as shared (?), because when I tried with these
variables (below), I got an error message (they are not shared - I’m just testing)

!$acc& shared(rhs,dt,eflux,fflux,gflux)

Also,
I noticed that my update self() directive was after the main iteration loop,
so I moved it to inside the iterator, and it did in fact slow the computation down,
but unfortunately there is still the same problem with divergence (in w array),
so that is frustrating.

One of the “copy” clauses. This will promote the scalar from being private to a thread to be shared by the threads.

Note that if you can share the source, I can take a look. If you don’t want to do this on the public forum, feel free to direct message me by clicking on my user name and select the “message” button on my profile.

thanks I will take you up on that!

what’s the difference between your create and my private?

!acc& create(Q,E1,E2,F1,F2,G1,G2)
!acc& create(un,vn,wn,pn,rn,tn,mn,rhs)

vs

!$acc& private(Q,E1,E2,F1,F2,G1,G2)
!$acc& private(un,vn,wn,pn,rn,tn,mn,rhs)

“create” creates a uninitialized device array or scalar that’s accessible and shared between all threads (vectors) in a kernel.

“private” creates an uninitialized device array or scalar that only accessible to either a gang, worker, or vector. Gang private is private to the gang but shared by the workers/vectors in the gang. Worker private is private to the worker by shared by the vectors. Vector private is private to each individual vector.

For example:

!$acc parallel loop gang private(Arr) 
do I=1,N
! Arr is gang private here
!$acc loop vector
    do J=1,M
    ! For vectors within the gang, Arr is shared
!$acc parallel loop gang 
do I=1,N
!$acc loop vector private (var)
    do J=1,M
    ! each vector has it's own private "var" variable
!$acc parallel loop gang vector collapse(2) private(var) 
do I=1,N
    do J=1,M
    !  Given both "gang" and "vector" are used, the private applies to "vector" and each vector has a private copy of "var"