Code crashing with 18.7 - worked with 18.4

Hi,
I am trying my code on PGI 18.7.
The code compiles but with some serialization issues that I have posted in other threads.
A much more severe problem is that the code crashes now on a run that worked when compiling with 18.4.
The crash is:

FATAL ERROR: data in PRESENT clause was not found on device 1: name=z_a_1 host:0x7fcb0f1c2190
 file:mas_sed_expmac.f pot3d_solver line:27394

where the offending line of code is:

!$acc kernels default(present)
      rhs_cg=rhs_cg-x_cg
!$acc end kernels

I do not understand since there is no array called z_a_1…
The arrays in that line of code are allocated and initialized by

      allocate(x_cg(N_cgvec))
      allocate(rhs_cg(N_cgvec))
!$acc enter data create(x_cg,rhs_cg)
!$acc kernels default(present)
      x_cg=0.
      rhs_cg=0.
!$acc end kernels

This is one the initializations that the compiler now serializes as shown in another thread.
Any ideas?
Many other parts of the code that are similar to this still seem to work, adding to my confusion.

  • Ron

Hi Ron,

Per the Fortran standard when using array syntax, the right-hand side of the equation must be fully evaluated before assignment to the left-had side. This means that the compile must create a temp array to store the results before assignment. “z_a_1” is a compiler generated temp array. Sometimes the compiler can optimize away the temp array.

What’s changed between 18.4 and 18.7 I’m not sure. Either we’re no longer optimizing away the temp array, or didn’t detect that the temp array wasn’t on the device so didn’t throw an error with “default(present)”.

The easiest thing to do is make this an explicit loop rather than use array syntax so you don’t need to worry about the temp array. Something like:

!$acc kernels default(present) 
      do i = 1, N_cgvec
          rhs_cg(i)=rhs_cg(i)-x_cg(i)
      enddo
!$acc end kernels

Hope this helps,
Mat

Hi,
Writing the explicit loop allows the code to run. Indeed, other similar loops had already been explicitly written out for optimization purposes.

However, I am confused about something. Even if a temporary array needs to be created by the compiler, since the code is in a compute region shouldn’t the compiler make the temporary array on the device such that the “default(present)” clause should still see the temporary array?
If so, then this would be a new bug since the code worked as-is previously.

Thanks for the quick responses!

  • Ron

Hi Ron,

Now that I’m back from China, I was able to spend a bit more time on this.

I wrote a small reproducer and determined that issue was a change in behavior in 18.7 where F2003 allocatable semantics were made the default. You can work around the issue by reverting to F95 semantics via the flag “-Mallocatable=95”.

I added a TPR (#26191) to see if we can get the compiler to optimize away the temp array or if this is part of the F2003 standard (in which case, not much can be done).

-Mat

% cat test.f90

program main

real, dimension(:), allocatable :: rhs_cg,x_cg
integer :: i

allocate(rhs_cg(1024), x_cg(1024))
!$acc enter data create(rhs_cg,x_cg)
do i=1,1024
   rhs_cg(i) = real(i+1)
   x_cg(i) = real(i)
enddo
!$acc update device(rhs_cg,x_cg)

!$acc kernels default(present)
rhs_cg=rhs_cg-x_cg
!$acc end kernels

!$acc update self(rhs_cg)
print *, rhs_cg(1:10)

deallocate(rhs_cg,x_cg)
end program main

% pgf90 test.f90 -ta=tesla:cc70 -Minfo=accel -V18.7 -fast ; a.out
main:
      8, Generating enter data create(x_cg(:),rhs_cg(:))
     13, Generating update device(x_cg(:),rhs_cg(:))
     15, Generating implicit present(rhs_cg(:),x_cg(1:1024),z_a_0(:))
     16, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         16, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     19, Generating update self(rhs_cg(:))
hostptr=0x7ffdd7a493a0,stride=1,size=1024,eltsize=4,name=z_a_0,flags=0x200=present,async=-1,threadid=1
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 7.0, threadid=1
host:0x12d9170 device:0x2ac09be00000 size:4096 presentcount:1+1 line:8 name:rhs_cg
host:0x12da1a0 device:0x2ac09be01000 size:4096 presentcount:1+1 line:8 name:x_cg
allocated block device:0x2ac09be00000 size:4096 thread:1
allocated block device:0x2ac09be01000 size:4096 thread:1
FATAL ERROR: data in PRESENT clause was not found on device 1: name=z_a_0 host:0x7ffdd7a493a0
 file: test.f90 main line:15

% pgf90 test.f90 -ta=tesla:cc70 -Minfo=accel -V18.7 -fast -Mallocatable=95 ; a.out
main:
      8, Generating enter data create(rhs_cg(:),x_cg(:))
     13, Generating update device(x_cg(:),rhs_cg(:))
     15, Generating implicit present(x_cg(1:1024),rhs_cg(1:1024))
     16, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         16, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     19, Generating update self(rhs_cg(:))
    1.000000        1.000000        1.000000        1.000000
    1.000000        1.000000        1.000000        1.000000
    1.000000        1.000000

Even if the F2003 is maintained and the temporary array is not optimized out, shouldn’t the compiler place that temporary array in device memory since it is in a compute region? Wouldn’t that solve the issue?

That’s the question I’m asking our engineers in TPR#26191.

Just for reference, this issue should be resolved with release 19.1