Hi,
My code is that of an 26 pt isotropic stencil, I want to pre-fetch some values into GPU shared memory, typically i-1:i+4 and j-1:j+4. I am not able to do this, and get a warning like:
PGF90-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): multiple indices in shared memory dimension (kernel.f90: 416)
or
PGF90-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): unknown shared array size (kernel.f90: 617)
when I suppress a dimension.
My code structure is as follows:
!$ACC KERNELS &
!$ACC PRESENT(p0,q0,phi,eta,roc2)
!$ACC LOOP INDEPENDENT
do k=k0,k1
!$ACC LOOP INDEPENDENT
do j=j0,j1
!$ACC LOOP INDEPENDENT
do i=i0,i1
!$ACC CACHE(p0(...),q0(...))
Perhaps this is a bad idea to make the cache construct execute so many times, I want to get some idea as to how it could be used efficiently in my case.
Thank you very much,
Sayan
UPDATE:
This is working when I specify the cache construct as:
!$ACC CACHE(p0(i-1:i+4,j-1:j+4,k-1:k+1), q0(i-1:i+4,j-1:j+4,k-1:k+1))
compilation info:
423, Cached references to size [(x+5)x6x(y+2)] block of 'q0'
Cached references to size [(x+5)x6x(y+2)] block of 'p0'
Now I get this warning instead:
PGF90-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): illegal opcode (kernel.f90: 416)
Code structure is the same as above. But the problem is that the code is terribly slow, I would need to change the loop mapping, any ideas welcome.