Hi,
this problem happens in various places in my code, but I will report the simplest case where it happens, a routine with 2 loops:
!$acc kernels present(t)
DO j = 1, je
DO i = 1, ie
tt_lheat(i,j,kup:klow,nnew) = tt_lheat(i,j,kup:klow,nnew) &
- t(i,j,kup:klow,nnew)
ENDDO
ENDDO
!$acc end kernels
and this what the compiler will generate:
99, Generating present(t(:,:,:,:))
Generating copy(tt_lheat(1:ie,1:je,kup:klow,nnew))
Generating local(t(1:ie,1:je,kup:klow,nnew))
Generating compute capability 2.0 binary
100, Loop is parallelizable
101, Loop is parallelizable
102, Loop is parallelizable
Accelerator kernel generated
100, !$acc loop gang, vector(4) ! blockidx%y threadidx%z
101, !$acc loop gang, vector(4) ! blockidx%x threadidx%y
102, !$acc loop vector(16) ! threadidx%x
CC 2.0 : 21 registers; 8 shared, 96 constant, 0 local memory bytes; 83% occupancy
clearly local (t) is not necessary, and it’s actually a problem because at run-time I have an error: it seems that the compiler generates a free for what considers the local array (this happens in another subroutine):
pgi_acc_dataoff(devptr=0x203ae01e4,hostptr=0x22de2b0,offset=7,stride=1,size=28,extent=36,eltsize=4,lineno=2375,name=t$sd,flags=0x700=create+present+copyin)
unmap dev:0x203ae0200 host:0x22de2b0 size:112 offset:28 data[dev:0x203ae0200 host:0x22de2b0 size:112] (line:2368 name:t$sd)
__pgi_cu_free( 0x203ae0200, lineno=2375, name=t$sd )
call to cuMemFree returned error 700: Launch failed
CUDA driver version: 4020
Using the version 12.3 the local(t) becomes a copy of the same subarray of t.
Best Regards
Tiziano