Nondeterministic Loop

Is anyone able to explain why this loop gives me nondeterministic results when I run it twice? This is a much reduced example from a larger application so it doesn’t calculate meaningful results in this form. If I collapse i,j,k and each thread gets a private flux_tmp, how does the nondeterministic behavior happen here? I would gladly provide further code if necessary.

    !$acc kernels present(q,ax,flux1)
    !$acc loop collapse(3) private(flux_tmp)
    do k = ilo3, ihi3
       do j = ilo2, ihi2
          do i = ilo1+1, ihi1
             do ivar = 1, NVAR
                flux_tmp(ivar) = 0.d0
             enddo
             flux_tmp(URHO) = q(i,j,k,QRHO)
             do ivar = 1, NVAR
                flux1(i,j,k,ivar) = flux1(i,j,k,ivar) + flux_tmp(ivar) * ax(i,j,k)
             enddo
          enddo
       enddo
    enddo
    !$acc end kernels

Hi jrood,

I’m not seeing anything obvious. Can you post the compiler feedback messages (-Minfo=accel) for this loop? That might give some clues.

Another thing to try is to make this a sequential (i.e. change “kernels” to “serial”) to make sure the issue is with the parallelism or if there’s a problem with the data movement or initial array values.

We also have new feature called PCAST (https://www.pgroup.com/blogs/posts/pcast.htm) which can help finding where divergent results begin in case the problem might be elsewhere in the code.

-Mat

Thanks for the help Mat and suggesting PCAST. For porting kernels in Fortran, I have typically been using KGEN from NCAR https://github.com/NCAR/KGen but it sounds great to have something like that which does the CPU and GPU code simultaneously for comparison.

Here is the compiler output from a more complete example:

pc_hyp_mol_flux:
    164, Generating update device(nvar)
    165, Generating enter data copyin(flux3(:,:,:,:),hi(:),lo(:),v(:,:,:),q(:,:,:,:),flux2(:,:,:,:),flux1(:,:,:,:))
         Generating enter data create(d(:,:,:,:))
         Generating enter data copyin(ax(:,:,:))
    167, Accelerator kernel generated
         Generating Tesla code
        168, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
        169,   ! blockidx%x threadidx%x collapsed
        170,   ! blockidx%x threadidx%x collapsed
        171, !$acc loop seq
        175, !$acc loop seq
    167, Generating implicit present(ax(ilo1+1:ihi1,ilo2:ihi2,ilo3:ihi3),q(ilo1+1:ihi1,ilo2:ihi2,ilo3:ihi3,:1),flux1(ilo1+1:ihi1,ilo2:ihi2,ilo3:ihi3,:nvar))
    171, Loop is parallelizable
    175, Loop is parallelizable
    183, Accelerator kernel generated
         Generating Tesla code
        184, !$acc loop gang, vector(128) collapse(4) ! blockidx%x threadidx%x
        185,   ! blockidx%x threadidx%x collapsed
        186,   ! blockidx%x threadidx%x collapsed
        187,   ! blockidx%x threadidx%x collapsed
    183, Generating implicit present(v(lo-2:hi+2,lo-2:hi+2,lo-2:hi+2),flux3(lo-2:hi+2,lo-2:hi+2,lo-2:hi+3,:nvar),d(lo-2:hi+2,lo-2:hi+2,lo-2:hi+2,:nvar),flux1(lo-2:hi+3,lo-2:hi+2,lo-2:hi+2,:nvar),flux2(lo-2:hi+2,lo-2:hi+3,lo-2:hi+2,:nvar))
    197, Generating exit data copyout(flux1(:,:,:,:))
         Generating exit data delete(hi(:),lo(:),v(:,:,:),q(:,:,:,:),ax(:,:,:))
         Generating exit data copyout(d(:,:,:,:))

And here is the more complete example as well with a few line number for reference. I’m really at a loss for what I’m missing with this one:

    nextra = 3
    ilo1=lo(1)-nextra
    ilo2=lo(2)-nextra
    ilo3=lo(3)-nextra
    ihi1=hi(1)+nextra
    ihi2=hi(2)+nextra
    ihi3=hi(3)+nextra

    do L=1,3
       qt_lo(L) = lo(L) - nextra
       qt_hi(L) = hi(L) + nextra
    enddo

    !$acc update device(nvar)
    !$acc enter data create(d) copyin(flux1,flux2,flux3) copyin(v,ax,q,lo,hi)

    !$acc parallel loop gang vector collapse(3) private(flux_tmp) default(present)
    do k = ilo3, ihi3 !line 168
       do j = ilo2, ihi2
          do i = ilo1+1, ihi1
             do ivar = 1, NVAR
                flux_tmp(ivar) = 0.d0
             enddo
             flux_tmp(1) = q(i,j,k,1)
             do ivar = 1, NVAR
                flux1(i,j,k,ivar) = flux1(i,j,k,ivar) + flux_tmp(ivar) * ax(i,j,k)
             enddo
          enddo
       enddo
    enddo !line 180
    !$acc end parallel
 
    !$acc parallel loop gang vector collapse(4) default(present)
    do ivar=1,NVAR
       do k = lo(3)-nextra+1, hi(3)+nextra-1
          do j = lo(2)-nextra+1, hi(2)+nextra-1
             do i = lo(1)-nextra+1, hi(1)+nextra-1
                d(i,j,k,ivar) = - (flux1(i+1,j,k,ivar) - flux1(i,j,k,ivar) &
                                +  flux2(i,j+1,k,ivar) - flux2(i,j,k,ivar) &
                                +  flux3(i,j,k+1,ivar) - flux3(i,j,k,ivar)) / v(i,j,k)
             enddo
          enddo
       enddo
    enddo
    !$acc end parallel

    !$acc exit data copyout(flux1,d) delete(v,ax,q,lo,hi)

So if I comment out the original loop using flux_tmp, it’s deterministic. Maybe it’s fine that it’s nondeterministic and the order of operations can get scheduled differently. I haven’t been able to verify my results are good or not. I have many other ACC loops in other parts of the code that don’t exhibit nondeterministic behavior though so that’s why this concerns me.

Maybe it’s fine that it’s nondeterministic and the order of operations can get scheduled differently.

Except the ivar loops are being run sequentially and flux1 uses all four dimensions, so there really isn’t an order of operations issue. The accumulation should be done in the same order each time.

I’m still not seeing anything wrong with this and wondering if it’s a data synchronization issue.

You update “nvar” which implies that there’s a data region at a higher level. Are any of the variables in the “enter/exit data” also managed in another data region higher up?

If so, you’ll want to use the “update” directive instead. The default is to use “present or” semantics on copy clauses. So if you have a “copyin” on a variable that’s already present, the “copyin” is ignored and the data isn’t updated on the device. Just the reference counter is updated on enter and decremented on exit. Not until there are no more references, or the “finalize” clause is used, will the device data be deleted.

You can check this quickly by adding the following update directives:

    !$acc update device(nvar) 
    !$acc enter data create(d) copyin(flux1,flux2,flux3) copyin(v,ax,q,lo,hi) 
    !$acc update device(flux1,flux2,flux3,ax,q,lo,hi)
    !$acc parallel loop gang vector collapse(3) private(flux_tmp) default(present) 
...
    !$acc end parallel 
    !$acc update self(flux1,d)
    !$acc exit data copyout(flux1,d) delete(v,ax,q,lo,hi)

If this fixes the problem, then you can adjust these data regions to only include the variables not in a higher level data region, and update the ones that are. In particular, I’d be looking at "ax, “q”, and “flux1”.

Note that you shouldn’t need to manage “nvar”. It’s a scalar so will default to using “firstprivate”. It shouldn’t be causing this issue, but it helps performance (albeit not by much) since now it’s a global reference needing to be fetched from memory as opposed to a local variable passed in as an argument (which is what “firstprivate” does).

If this still doesn’t solve the problem, can you share the code? If so, please send a note with a reproducer to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me.

It’s possible that there’s a compiler bug, but this code looks fairly common so I would think that we would have encountered something similar before if it was a bug. Still, it would be good for me to investigate just in case.

-Mat

Mat, thanks again for the help. I tried adding in the update directives and when I run the program twice, it still diffs. The code I am working on is open source, so I will try to get a branch together of this example and steps on how to try to reproduce it. I fear I’m missing something simple, but I’ve spent a few days trying to solve this and I just don’t understand it.

There are definitely Fortran module variables I deal with, which makes it troublesome, but I’ve tried hardcoding most of them out as well. I use unstructured data statements, but they are as you see them. For each kernel I use them before and after, so that later I will attempt to try to optimize for one data statement before and after all the kernels.

Here is what I just tried:

    nextra = 3
    ilo1=lo(1)-nextra
    ilo2=lo(2)-nextra
    ilo3=lo(3)-nextra
    ihi1=hi(1)+nextra
    ihi2=hi(2)+nextra
    ihi3=hi(3)+nextra

    do L=1,3
       qt_lo(L) = lo(L) - nextra
       qt_hi(L) = hi(L) + nextra
    enddo

    !$acc update device(nvar)
    !$acc enter data create(d) copyin(flux1,flux2,flux3) copyin(v,ax,q,lo,hi)
    !$acc update device(flux1,flux2,flux3,ax,q,lo,hi)
    !$acc parallel loop gang vector collapse(3) private(flux_tmp) default(present)
    do k = ilo3, ihi3
       do j = ilo2, ihi2
          do i = ilo1+1, ihi1
             do ivar = 1, NVAR
                flux_tmp(ivar) = 0.d0
             enddo
             flux_tmp(1) = q(i,j,k,1)
             do ivar = 1, NVAR
                flux1(i,j,k,ivar) = flux1(i,j,k,ivar) + flux_tmp(ivar) * ax(i,j,k)
             enddo
          enddo
       enddo
    enddo
    !$acc end parallel
 
    !$acc parallel loop gang vector collapse(4) default(present)
    do ivar=1,NVAR
       do k = lo(3)-nextra+1, hi(3)+nextra-1
          do j = lo(2)-nextra+1, hi(2)+nextra-1
             do i = lo(1)-nextra+1, hi(1)+nextra-1
                d(i,j,k,ivar) = - (flux1(i+1,j,k,ivar) - flux1(i,j,k,ivar) &
                                +  flux2(i,j+1,k,ivar) - flux2(i,j,k,ivar) &
                                +  flux3(i,j,k+1,ivar) - flux3(i,j,k,ivar)) / v(i,j,k)
             enddo
          enddo
       enddo
    enddo
    !$acc end parallel
    !$acc update self(flux1,d)
    !$acc exit data copyout(flux1,d) delete(v,ax,q,lo,hi)

Note that commenting out the first ACC loop calculating flux1 solves the two runs of the application from diffing.

Hi jrood,

Thanks for sending a reproducing example to PGI Customer Service. I was able to determine the error is being caused by a compiler error where it’s not properly privatizaing the “flux_tmp” array since “flux_tmp” is an automatic. The work around would be to make “flux_tmp” fix size.

I’ve added TPR #27040 and sent it to our engineers for further investigation.

Best Regards,
Mat

TPR 27040 is fixed in the NVIDIA HPC SDK compilers starting with release 20.5.