Automatic kernel producing out of bounds reads

The following generates a stream of out of bounds global reads according to cuda-memcheck when compiled with pgf95 versions 12.5 and 13.3. It works with m=128 and/or using mod instead of iand.

Looking at the .gpu files, the combination of m=256 and iand instead of mod produces a very different kernel with an extra argument, compared with the other three combinations.

implicit none
integer, parameter :: m=256 ! try 128 instead
integer, parameter :: n=32
real8, device, dimension(0:m-1,n,n) :: f
real
8, dimension(0:m-1,n,n) :: fhost
integer shift,i,ip
shift=m
f = 1d0
!$cuf kernel do(1) <<< (), () >>>
do i=0,m-1
ip = iand(i+shift,m-1)
! ip = mod(i+shift,m) ! try this instead
f(ip,n,n) = 0.5d0f(ip,n,n)
enddo
fhost = f
write (6,
) "returned value ",sum(fhost)
end

Hi Paul,

Thanks for the report. I added a problem report (TPR#19339) and sent it on to engineer. It’s interesting that the error only occur at 256. Change this to 255 or 257, then it’s fine. Also, the code runs to completion if I use OpenACC instead, but it doesn’t look like it correct answers. Again, it’s fine when m is not 256. Interesting case.

  • Mat

Dear Mat,

Thanks for confirming, and for filing a problem report. I’m not very familiar with OpenACC, but according to -Minfo the 13.3 compiler generates a scalar kernel because it can’t establish that the loop iterations are independent.

Looking at the .gpu files, launch_bounds is set to 1 in the kernel generated using OpenACC, but to 128 in the parallel kernel generated by a !$cuf directive. That may be why my example works with OpenACC, but presumably doesn’t run very fast.

Paul

Thanks for confirming, and for filing a problem report.

You’re welcome.

OpenACC, but according to -Minfo the 13.3 compiler generates a scalar kernel because it can’t establish that the loop iterations are independent.

Yes, however you can add the “independent” clause to tell the compiler that it is independent.

For example:

% cat out.f90
implicit none
integer, parameter :: m=256 ! try 128 instead
integer, parameter :: n=32
#ifdef _CUDA
real*8, device, dimension(0:m-1,n,n) :: f
#else
real*8, dimension(0:m-1,n,n) :: f
#endif
real*8, dimension(0:m-1,n,n) :: fhost
integer shift,i,ip,ierr
shift=m
f = 1d0
#ifdef _OPENACC
!$acc kernels loop independent
#else
!$cuf kernel do(1) <<< (*), (*) >>>
#endif
do i=0,m-1
ip = iand(i+shift,m-1)
! ip = mod(i+shift,m) ! try this instead
f(ip,n,n) = 0.5d0*f(ip,n,n)
enddo
fhost = f
write (6,*) "returned value ",sum(fhost)
end
% pgf90 out.f90 -Mpreprocess -Minfo -Mcuda -acc ; a.out
MAIN:
18, Loop is parallelizable
Accelerator kernel generated
18, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
24, sum reduction inlined
returned value 262144.0000000000
  • Mat

TPR 19339 - CUF: user example code gets runtime error when using “ishift”

has been fixed in the now available 13.6 release.

regards,
dave