Hi Ron,
Now that I’m back from China, I was able to spend a bit more time on this.
I wrote a small reproducer and determined that issue was a change in behavior in 18.7 where F2003 allocatable semantics were made the default. You can work around the issue by reverting to F95 semantics via the flag “-Mallocatable=95”.
I added a TPR (#26191) to see if we can get the compiler to optimize away the temp array or if this is part of the F2003 standard (in which case, not much can be done).
-Mat
% cat test.f90
program main
real, dimension(:), allocatable :: rhs_cg,x_cg
integer :: i
allocate(rhs_cg(1024), x_cg(1024))
!$acc enter data create(rhs_cg,x_cg)
do i=1,1024
rhs_cg(i) = real(i+1)
x_cg(i) = real(i)
enddo
!$acc update device(rhs_cg,x_cg)
!$acc kernels default(present)
rhs_cg=rhs_cg-x_cg
!$acc end kernels
!$acc update self(rhs_cg)
print *, rhs_cg(1:10)
deallocate(rhs_cg,x_cg)
end program main
% pgf90 test.f90 -ta=tesla:cc70 -Minfo=accel -V18.7 -fast ; a.out
main:
8, Generating enter data create(x_cg(:),rhs_cg(:))
13, Generating update device(x_cg(:),rhs_cg(:))
15, Generating implicit present(rhs_cg(:),x_cg(1:1024),z_a_0(:))
16, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
16, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
19, Generating update self(rhs_cg(:))
hostptr=0x7ffdd7a493a0,stride=1,size=1024,eltsize=4,name=z_a_0,flags=0x200=present,async=-1,threadid=1
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 7.0, threadid=1
host:0x12d9170 device:0x2ac09be00000 size:4096 presentcount:1+1 line:8 name:rhs_cg
host:0x12da1a0 device:0x2ac09be01000 size:4096 presentcount:1+1 line:8 name:x_cg
allocated block device:0x2ac09be00000 size:4096 thread:1
allocated block device:0x2ac09be01000 size:4096 thread:1
FATAL ERROR: data in PRESENT clause was not found on device 1: name=z_a_0 host:0x7ffdd7a493a0
file: test.f90 main line:15
% pgf90 test.f90 -ta=tesla:cc70 -Minfo=accel -V18.7 -fast -Mallocatable=95 ; a.out
main:
8, Generating enter data create(rhs_cg(:),x_cg(:))
13, Generating update device(x_cg(:),rhs_cg(:))
15, Generating implicit present(x_cg(1:1024),rhs_cg(1:1024))
16, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
16, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
19, Generating update self(rhs_cg(:))
1.000000 1.000000 1.000000 1.000000
1.000000 1.000000 1.000000 1.000000
1.000000 1.000000