Is it possible to manage shared memory? There is a clause cache that can be used to hint the compiler. I have the following acc region. the kreal and kimag values of a particular point are stored in separate array (ktempreal,ktempimag) and used in all the nterms iterations. The idea here is to move this array to shared memory.
!$acc region
!$acc do kernel private (ktempreal, ktempimag) cache(ktempreal,ktempimag)
DO p = 1,npoints
!store k values in ktemp array
DO r = 1,order+1
DO mu = 0,NDIR-1
ktempreal(mu,r) = kreal(mu,r,p)
ktempimag(mu,r) = kimag(mu,r,p)
END DO
END DO
! use the ktemp values for all nterms iterations
DO i = 1,nterms
phase3real = 0
phase3imag = 0
DO r = 1,order+1
DO mu = 0,NDIR-1
phase3real = phase3real - ktempimag(mu,r) * yxv(mu,r,i)
phase3imag = phase3imag + ktempreal(mu,r) * yxv(mu,r,i)
END DO
END DO
vtxgpureal(p) = vtxgpureal(p) + phase3real
vtxgpuimag(p) = vtxgpuimag(p) + phase3imag
END DO
END DO
!$acc end region
The compiler generated messages are as follows
60, Generating copyin(kimag(0:ndir-1,1:order+1,1:npoints))
Generating copyin(kreal(0:ndir-1,1:order+1,1:npoints))
Generating copy(vtxgpuimag(0,0,0,1:npoints))
Generating copy(vtxgpureal(0,0,0,1:npoints))
Generating copyin(yxv(0:ndir-1,1:order+1,1:nterms))
Generating compute capability 2.0 binary
62, Loop is parallelizable
Accelerator kernel generated
62, !$acc do parallel, vector(32)
Non-stride-1 accesses for array 'vtxgpuimag'
Non-stride-1 accesses for array 'vtxgpureal'
CC 2.0 : 30 registers; 4 shared, 152 constant, 0 local memory bytes; 16 occupancy
63, Loop is parallelizable
64, Loop is parallelizable
69, Complex loop carried dependence of 'vtxgpureal' prevents parallelization
Loop carried dependence of 'vtxgpureal' prevents parallelization
Loop carried backward dependence of 'vtxgpureal' prevents vectorization
Complex loop carried dependence of 'vtxgpuimag' prevents parallelization
Loop carried dependence of 'vtxgpuimag' prevents parallelization
Loop carried backward dependence of 'vtxgpuimag' prevents vectorization
72, Loop is parallelizable
73, Loop is parallelizable
Please advice.