Unexpected crash when using `!$acc cache` directive

Dear all,

When trying to test the use of shared memory (using NVHPC 22.7) on my old P2000 GPU, I got an error. The following simplified program below results in out-of-bounds errors when compiled and ran as follows:

nvfortran -acc test_shared.f90 && compute-sanitizer ./a.out

This is the program

program test_shared
  implicit none
  integer , parameter ::  nx = 32, ny = 32
  real(8), allocatable, dimension(:,:) :: a
  real(8), allocatable, dimension(:,:) :: b
  integer :: i,j
  real(8) :: aip,aim,ajp,ajm
  allocate(a(0:nx+1,0:ny+1))
  allocate(b(0:nx+1,0:ny+1))
  a(:,:) = 1.
  !$acc enter data copyin(a) create(b)
  !$acc parallel loop collapse(2) default(present) private(aip,aim,ajp,ajm)
    do j=1,ny
      do i=1,nx
        !$acc cache(a(i-1:i+1,j-1:j+1))
        aip  = a(i+1,j)+a(i,j)
        aim  = a(i-1,j)+a(i,j)
        ajp  = a(i,j+1)+a(i,j)
        ajm  = a(i,j-1)+a(i,j)
        b(i,j) = aip + aim + ajp + ajm + akp + akm
      end do
    end do
  !$acc update self(b)
  print*,'b(10,10) = ', b(10,10)
end

while the code runs fine if I remove the cache directive. If I don’t use compute-sanitizer, the code will crash on my P2000 GPU for sufficiently large array sizes, which unfortunately is the case in my actual problem :(.

Thanks in advance!

Hi p.simoes,costa,

In general, we don’t recommend using the “cache” directive. Besides tricky to use, it turn out to be very difficult to implement a low level concept such as shared memory caching in a high level directive based model.

Here you have the cache directive at the vector level, meaning each thread will load this memory into local memory which wouldn’t help with performance. It’s basically no different than fetching the memory in the body of the loop. Plus there’s not re-use, and if there was, the hardware managed caching is quite good so no need to add software managed caching.

If the intent was to bring in a block of memory into shared memory for use by multiple threads, you’d need to move the cache directive to the gang level and give the full range of elements to cache (below). Given the array bounds in only 32, it should fit in shared memory, but if it’s larger, it may not.

  !$acc enter data copyin(a) create(b)
  !$acc parallel loop gang default(present)
    do j=1,ny
      !$acc cache(a(1:nx,j-1:j+1))
      !$acc loop vector private(aip,aim,ajp,ajm)
      do i=1,nx
        aip  = a(i+1,j)+a(i,j)
        aim  = a(i-1,j)+a(i,j)
        ajp  = a(i,j+1)+a(i,j)
        ajm  = a(i,j-1)+a(i,j)
        b(i,j) = aip + aim + ajp + ajm
      end do
    end do

The very old Kepler devices did rely on software managed cache to get the best performance, but with Pascal and newer architectures, hardware managed caching is quite good, mitigating the need for software managed caching. Not that it can’t still help, it can, but often you’d need to use CUDA in order to express the caching of shared memory in order to be effective.

-Mat

1 Like

Thank you very much, Mat. This helps a lot.

Pedro