I am developing a multi-dimension transpose kernel in OpenACC. The relevant section looks like this:
real*8, dimension(0:tile_size-1) :: s
integer, dimension(0:tile_size-1) :: is, id
integer :: ia,ib,id_b,it_b,it_a,im_b,im_a,ic,i,ib0,id_l,is_l
!$acc parallel present(data_in,data_out,shape_a,shape_b,stride_a_l,stride_a_g,stride_b)
!$acc loop independent gang private(s,is,id,ia,it_b,im_b,id_b,ib0)
do ib=0,vol_b-1
!$acc cache(s,is,id)
ib0 = ib*vol_a
it_b=ib; id_b=0
!$acc loop seq
do i=0,dim_b-1
im_b = it_b/shape_b(i)
id_b = id_b + stride_b(i)*(it_b-im_b*shape_b(i))
it_b = im_b
end do
!$acc loop vector private(it_a,id_l,is_l)
do ia=0,vol_a-1
it_a=ia; id_l=id_b; is_l=0
!$acc loop seq
do i=0,dim_a-1
im_a = it_a/shape_a(i)
ic = it_a - im_a*shape_a(i)
id_l = id_l + stride_a_g(i)*ic
is_l = is_l + stride_a_l(i)*ic
it_a = im_a
end do
id(ia) = id_l
is(ia) = is_l
end do
!$acc loop vector
do ia=0,vol_a-1
s(ia)=data_in(ia+ib0)
end do
!$acc loop vector
do ia=0,vol_a-1
data_out(id(ia)) = s(is(ia))
end do
end do
!$acc end parallel
Basically I pre-calculate all the indices in the write loop. Note that I also have a version which does index calculation inside the write loop (so there is only one shared-memory array in the kernel), which works well on GPU. I ran both version on CPU and they generate exactly the same index sequence, so the logic should be correct. However, when I run this version with pre-calculated indices I would get
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
Failing in Thread:1
call to cuMemFreeHost returned error 700: Illegal address during kernel execution
If I use cuda-memcheck I would find something like this:
========= Invalid __shared__ read of size 8
========= at 0x000008b0 in f_tt_86_gpu
========= by thread (31,0,0) in block (111,0,0)
========= Address 0x5f952c18 is out of bounds
The OFB is due to invalid shared memory access in the write loop. I tend to suspect that GPU code generation fails when multiple shared-memory arrays are involved.
Without pre-calculating the indices, the performance can be 2X worse, so I would love to see a solution.
Thank you!