Hello all,
I’m just trying to use CUDA Fortran and running some simple examples. And I have a problem with constant memory.
This is code of main program:
program main
use kernel
use cudafor
implicit real*8 (a-h,o-z)
type(dim3) N_blocks, N_threads
integer, parameter :: N=128
real*8, dimension(N, N, N) :: res
real*8, dimension(N, N, N), device :: resD
real*8, device :: x
theta=4
x=1.d0
N_blocks=dim3(N,N,1)
N_threads=dim3(N,1,1)
call kernel<<<N_blocks>>>(x,resD, N)
res=resD
sum=0.d0
do i=1,N
do j=1,N
do k=1,N
sum=sum+res(i,j,k)
end do
end do
end do
print*,'sum = ', sum
end program main
And this is my kernel:
module kernel
integer, constant :: i1=2, i2=2
real*8, constant :: pi=acos(-1.d0)
real*8, dimension(2), constant :: theta
contains
attributes(global) subroutine kernel(x, resD, N)
use cudafor
implicit none
real*8, device :: x
real*8, dimension (N, N, N), device :: resD
integer, value :: N
integer i,j,k
k=threadIdx%x
i=blockIdx%x
j=blockIdx%y
! resD(i,j,k)=i1*x*theta(1) ! this works
resD(i,j,k)=i2*x*theta(1) ! but this does not work!
end subroutine kernel
end module kernel
The problem is that this kernel works when I use constant value i1, and does not work with constant i2
(error - “0: copyout Memcpy (host=0x6ade60, dev=0xf200000000, size=16777216) FAILED: 4(unspecified launch failure)”).
But there are no differenses between i1 and i2 at all, exept only one - I declarated firstly i1 and then i2. If I declarate i2 and then i1 this code will work for i2 and will not work for i1…
I guess that there is any problem with access to constant memoty at device, but I don’t understand what the problem is exactly? How should I use constant memory in order to skip such errors?