Consider the following kernel:
attributes(global) subroutine reorder(ncs, N1, N2, jmax, cout, cin)
integer, value :: ncs, N1, N2, jmax
real,device :: cout(N1,N2), cin(N1, N2)
integer :: tid, bn, indx
integer :: jnew, iout
tid = threadidx%x
bn = (blockidx%y-1) * griddim%x + blockidx%x-1
indx = bn*blockdim%x + tid
iout = jreorder(indx)
if(iout .ge. 1 .and. iout .le. N1) then
do jnew=1,jmax
cout(iout,jnew) = cin(indx,jnew)
end do
end if
end subroutine reorder
where jreorder is a permutation vector (i.e., guaranteed to have no duplicate entries) that is declared in the enclosing module. The kernel is called thus:
call unreorder<<<gsize,bsize>>>(ncs, NCELL, MXGSAER, ischang(ncs), cx, cnew)
where, as it happens, cx and cnew are also declared in the enclosing module and are declared (or allocated – I’ve tried both) as NCELL,MXGSAER. Also, the 4th argument is verified to be less than MXGSAER, so we are guaranteed not to exceed the array bounds.
So far as I can tell, this should work. However, when I run the code as described, the results are incorrect, and in fact they change when you switch from statically declared to run-time allocated arrays (suggesting that faulty memory access is to blame). If, on the other hand, I change the kernel to this:
attributes(global) subroutine reordercx(ncs,jmax)
integer, value :: ncs,ischan
integer :: tid, bn, indx
integer :: jnew,iout
tid = threadidx%x
bn = (blockidx%y-1) * griddim%x + blockidx%x-1
indx = bn*blockdim%x + tid
iout = jreorder(indx)
do jnew=1,jmax
cx(iout, jnew) = cnew(indx,jnew)
end do
end subroutine reordercx
taking advantage of the fact that cx and cnew are available as module variables, the answers are correct for both the declared and allocated versions.
Is this a compiler bug, or am I overlooking some error I committed in the first version of the kernel?
-robert.