CUDA_EXCEPTION_10, Device Illegal Address

I don’t know why my job put the following error messages via the CUDA debgger.
When I changed the variable “b”, there is no error messages via the CUDA debugger

integer(kind=4),dimension(L) :: b -> integer(kind=4),dimension(256) :: b

Could you let me know if the following source code is incorrect use of CUDA fortran ?

(cuda-gdb) run
Starting program: /home/kouchi/test/run/…/bin/test
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
[New Thread 0x7fffe8c60700 (LWP 19024)]

CUDA Exception: Device Illegal Address
The exception was triggered in device 0.

Program received signal CUDA_EXCEPTION_10, Device Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (83,0,0), thread (0,0,0), device 0, sm 7, warp 26, lane 0]
0x0000000002ad52c0 in cudamod::cuda_kernel<<<(256,1,1),(128,1,1)>>> () at test.cuf:18
18 do i = 1,L
(cuda-gdb)

------------------------------------- test.cuf -----------------------------------
module cudamod
use cudafor
implicit none

contains
attributes(global) subroutine cuda_kernel(L,N,M,a)
integer(kind=4),value :: L,N,M
integer(kind=4),dimension(L,N,M),intent(in),device :: a

! local variables
integer(kind=4) :: i,tid,bid
! local array
integer(kind=4),dimension(L) :: b

bid = blockidx%x
tid = threadidx%x

do i = 1,L
b(i) = a(i,bid,tid)
end do

do i = 1,L
a(i,bid,tid) = b(i) + 1
end do

end subroutine cuda_kernel
end module cudamod

--------------------------------- main.cuf ---------------------------------------------------
program test
use cudamod
implicit none
integer(kind=4),parameter :: L = 256, N = 256, M = 128
integer(kind=4),dimension(L,N,M) :: a
integer(kind=4),dimension(L,N,M),device :: d_a
integer(kind=4) :: istat

a = 0
d_a = a
call cuda_kernel<<<N,M>>>(L,N,L,d_a)
istat = cudaDeviceSynchronize()
end program test

Hi KOUCHI_Hiroyuki,

There’s a few problems here. First, you have typo in your call to cuda_kernel where you’re passing “L” instead of “M” as the third argument.

The larger issue is that by default, there’s only an 8MB heap on the device. Given that automatic arrays are implicitly allocated, this means that every thread is allocating “B” using a total of 32MB of heap. To increase the heap size, you can make a call to “cudaDeviceSetLimit” as shown below.

A third problem is that 32MB is the max heap size. Since not all of this space is available to users, “L” can have a max size of 236 (or you can reduce the total number of theads).

Note that using automatics in device kernels is supported but not recommended. Besides the heap size limits, your performance will be effected since memory allocation from device code is expensive.

Hope this helps,
Mat


module cudamod
 use cudafor
 implicit none

 contains
 attributes(global) subroutine cuda_kernel(L,N,M,a)
 integer(kind=4),value :: L,N,M
! integer(kind=4),dimension(L,N,M),intent(in),device :: a
 integer(kind=4),dimension(L,N,M),device :: a

 ! local variables
 integer(kind=4) :: i,tid,bid
 ! local array
 integer(kind=4),dimension(L) :: b

 bid = blockidx%x
 tid = threadidx%x

 if (bid .le. N .and. tid .le. M) then
 do i = 1,L
 b(i) = a(i,bid,tid)
 end do

 do i = 1,L
 a(i,bid,tid) = b(i) + 1
 end do
 else
 print *, "Error: out-of-bounds: ", bid, tid
 end if

 end subroutine cuda_kernel
 end module cudamod
program test
 use cudamod
 implicit none
 integer(kind=4),parameter :: L = 236, N = 256, M = 128
 integer(kind=4),dimension(L,N,M) :: a
 integer(kind=4),dimension(L,N,M),device :: d_a
 integer(kind=4) :: istat
 integer(kind=cuda_count_kind) :: val
 integer :: rc

 rc= cudaDeviceGetLimit( val, cudaLimitMallocHeapSize )
 print *, "Max Heap Size: ", val
 val = (L*M*N*4)+val
 rc= cudaDeviceSetLimit( cudaLimitMallocHeapSize, val )
 rc= cudaDeviceGetLimit( val, cudaLimitMallocHeapSize )
 print *, "New Heap Size: ", val
 a = 0
 d_a = a
 call cuda_kernel<<<N,M>>>(L,N,M,d_a)
 istat = cudaDeviceSynchronize()
 end program test

Dear Mat-san,

Thank you for the information.

Sincerely yours,