Out-of-bound shared memory access

I am developing a multi-dimension transpose kernel in OpenACC. The relevant section looks like this:

real*8, dimension(0:tile_size-1) :: s
integer, dimension(0:tile_size-1) :: is, id
integer :: ia,ib,id_b,it_b,it_a,im_b,im_a,ic,i,ib0,id_l,is_l

!$acc parallel present(data_in,data_out,shape_a,shape_b,stride_a_l,stride_a_g,stride_b)
!$acc loop independent gang private(s,is,id,ia,it_b,im_b,id_b,ib0)
do ib=0,vol_b-1

  !$acc cache(s,is,id)                                                                 

  ib0 = ib*vol_a
  it_b=ib; id_b=0
  !$acc loop seq
  do i=0,dim_b-1
    im_b = it_b/shape_b(i)
    id_b = id_b + stride_b(i)*(it_b-im_b*shape_b(i))
    it_b = im_b
  end do

  !$acc loop vector private(it_a,id_l,is_l)
  do ia=0,vol_a-1
    it_a=ia; id_l=id_b; is_l=0
    !$acc loop seq
    do i=0,dim_a-1
      im_a = it_a/shape_a(i)
      ic = it_a - im_a*shape_a(i)
      id_l = id_l + stride_a_g(i)*ic
      is_l = is_l + stride_a_l(i)*ic
      it_a = im_a
    end do
    id(ia) = id_l
    is(ia) = is_l
  end do

  !$acc loop vector
  do ia=0,vol_a-1
  end do

  !$acc loop vector
  do ia=0,vol_a-1
    data_out(id(ia)) = s(is(ia))
  end do

end do
!$acc end parallel

Basically I pre-calculate all the indices in the write loop. Note that I also have a version which does index calculation inside the write loop (so there is only one shared-memory array in the kernel), which works well on GPU. I ran both version on CPU and they generate exactly the same index sequence, so the logic should be correct. However, when I run this version with pre-calculated indices I would get

Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
Failing in Thread:1
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

If I use cuda-memcheck I would find something like this:

========= Invalid __shared__ read of size 8
=========     at 0x000008b0 in f_tt_86_gpu
=========     by thread (31,0,0) in block (111,0,0)
=========     Address 0x5f952c18 is out of bounds

The OFB is due to invalid shared memory access in the write loop. I tend to suspect that GPU code generation fails when multiple shared-memory arrays are involved.

Without pre-calculating the indices, the performance can be 2X worse, so I would love to see a solution.

Thank you!

Hi rikisyo,

Unfortunately, I’m not sure what’s wrong and I don’t see any issues with your code. My best guess is that it’s compiler issue having to do with the cache directive.

Does the code work if you comment out the cache directive?
Also, what compiler version are you using? We have made improvements in the cache directive so if you’re using an older version, please try our latest version (currently 16.10).

If using 16.10 does not fix the problem, can you please send a reproducing example (or the whole project) to PGI Customer Service (trs@pgroup.com)? We’d like to diagnose the problem and if it is a compiler error, get the problem fixed.


Mat thanks for your quick response!

Yes the code works perfectly if I disable the cache directive, or removing the “safecache” option in -ta compiler argument, which effectively disables the cache directive as the arrays are automatic.

I am afraid I don’t have the option to upgrade to 16.10. I will file a report to PGI CS.

This is now a confirmed bug.

As a workaround, converting tile_size to a constant would allow the code to work properly.

Hi Mat,

will you release a 16.10 version or is this a typo. On your homepage the latest version seems to be 16.9.


I meant 16.10 but didn’t realize 16.10 isn’t available for download as of yet. It should be posted very soon.

  • Mat

Actually this bug has been confirmed by PGI Premier Support on 16.10.

TPR 23270 - Caching works when size is a parameter, not when passed in.

is fixed in the current 17.1 release.