I am developing a multi-dimension transpose kernel in OpenACC. The relevant section looks like this:
real*8, dimension(0:tile_size-1) :: s integer, dimension(0:tile_size-1) :: is, id integer :: ia,ib,id_b,it_b,it_a,im_b,im_a,ic,i,ib0,id_l,is_l !$acc parallel present(data_in,data_out,shape_a,shape_b,stride_a_l,stride_a_g,stride_b) !$acc loop independent gang private(s,is,id,ia,it_b,im_b,id_b,ib0) do ib=0,vol_b-1 !$acc cache(s,is,id) ib0 = ib*vol_a it_b=ib; id_b=0 !$acc loop seq do i=0,dim_b-1 im_b = it_b/shape_b(i) id_b = id_b + stride_b(i)*(it_b-im_b*shape_b(i)) it_b = im_b end do !$acc loop vector private(it_a,id_l,is_l) do ia=0,vol_a-1 it_a=ia; id_l=id_b; is_l=0 !$acc loop seq do i=0,dim_a-1 im_a = it_a/shape_a(i) ic = it_a - im_a*shape_a(i) id_l = id_l + stride_a_g(i)*ic is_l = is_l + stride_a_l(i)*ic it_a = im_a end do id(ia) = id_l is(ia) = is_l end do !$acc loop vector do ia=0,vol_a-1 s(ia)=data_in(ia+ib0) end do !$acc loop vector do ia=0,vol_a-1 data_out(id(ia)) = s(is(ia)) end do end do !$acc end parallel
Basically I pre-calculate all the indices in the write loop. Note that I also have a version which does index calculation inside the write loop (so there is only one shared-memory array in the kernel), which works well on GPU. I ran both version on CPU and they generate exactly the same index sequence, so the logic should be correct. However, when I run this version with pre-calculated indices I would get
Failing in Thread:1 call to cuStreamSynchronize returned error 700: Illegal address during kernel execution Failing in Thread:1 call to cuMemFreeHost returned error 700: Illegal address during kernel execution
If I use cuda-memcheck I would find something like this:
========= Invalid __shared__ read of size 8 ========= at 0x000008b0 in f_tt_86_gpu ========= by thread (31,0,0) in block (111,0,0) ========= Address 0x5f952c18 is out of bounds
The OFB is due to invalid shared memory access in the write loop. I tend to suspect that GPU code generation fails when multiple shared-memory arrays are involved.
Without pre-calculating the indices, the performance can be 2X worse, so I would love to see a solution.