issue with 2d array copy back to host

I’m facing some issue with a cuda fortran test program involving 2 dimensional arrays. The data copy of 2D array from device to host is not working. Please verify the below program and let me know where I’m doing the wrong.


$ pgf90 -Mcuda multi_dim.cuf -o multi_dim.exe

$ ./multi_dim.exe
Data host to device
kernel finish
0: copyout Memcpy (host=0x675b80, dev=0x700200000, size=32) FAILED: 4(unspecified launch failure)
$


attributes(global) subroutine mdimen_kernel(m,n,a)

implicit none
integer i,j
integer, value :: m,n

real :: a(:,:)

i = ( blockIdx %x -1)* blockDim %x + threadIdx %x
! 2 blocks & 4 threads
a( blockIdx %x , threadIdx %x) = i

end subroutine


program testmdim

use cudafor
implicit none

integer m,n,i,j,istat
parameter (m = 2, n = 4)
! Separately two arrays are declared. aout is the array to 
! hold back the values from kernel
real :: a(m,n), aout(m,n)
real, device :: a_d(m,n)

! Initialize host array
do i = 1,m
do j = 1,n
a(i,j) = i+j
enddo
enddo

! Data copy from host to device array

a_d = a
write(*,*) 'Data host to device'

call mdimen_kernel<<<2,4>>> (m,n,a_d)
istat=cudaThreadSynchronize()
write(*,*) 'kernel finish'

! Data copy back to host from kernel
aout = a_d
write(*,*) 'Data device to host'

! Printing the kernel output
do i=1,m
do j=1,n
	write(*,*) 'aout( ',i,' ',j,' )', aout(i,j)
enddo
enddo

end program

Hi SanBc,

Typically these errors mean that your kernel crashed for some reason. Unless you specifically check for errors after the kernel returns, the error will show up in the next time the device is used.

The reason for the kernel crash is that you have failed to provide an implicit or explicit interface to the kernel. This is a requirement when passing F90 assumed shape arrays. Otherwise, Fortran defaults to using F77 calling conventions and will only pass a pointer to the array.

The easiest fix to put your kernel in a module.

module foo
contains
attributes(global) subroutine mdimen_kernel(m,n,a)

 implicit none
 integer i,j
 integer, value :: m,n

 real :: a(:,:)

 if (blockIdx%x .le. m .and. threadIdx%x .le. n) then
 i = ( blockIdx %x -1)* blockDim %x + threadIdx %x
 ! 2 blocks & 4 threads
 a( blockIdx%x , threadIdx%x) = i
 endif

 end subroutine
end module foo

 program testmdim

 use cudafor
 use foo
 implicit none

 integer m,n,i,j,istat
 parameter (m = 2, n = 4)
 ! Separately two arrays are declared. aout is the array to
 ! hold back the values from kernel
 real :: a(m,n), aout(m,n)
 real, device :: a_d(m,n)

 ! Initialize host array
 do i = 1,m
 do j = 1,n
 a(i,j) = i+j
 enddo
 enddo

 ! Data copy from host to device array

 a_d = a
 write(*,*) 'Data host to device'

 call mdimen_kernel<<<2,4>>> (m,n,a_d)
 istat=cudaThreadSynchronize()
 write(*,*) 'kernel finish'

 ! Data copy back to host from kernel
 aout = a_d
 write(*,*) 'Data device to host'

 ! Printing the kernel output
 do i=1,m
 do j=1,n
    write(*,*) 'aout( ',i,' ',j,' )', aout(i,j)
 enddo
 enddo

 end program

Hope this helps,
Mat

Hi Mat,

Thanks for the solution

The reason for the kernel crash is that you have failed to provide an implicit or explicit interface to the kernel. This is a requirement when passing F90 assumed shape arrays. Otherwise, Fortran defaults to using F77 calling conventions and will only pass a pointer to the array.

Can we define multiple kernels in a single module? Or do we have to include each kernel in a separate module?

Can you give some details about how to debug cuda Fortran programs. The errors doesn’t show where exactly the program has failed. Are there any special cuda debuggers available?

Thanks
Sangamesh

Can we define multiple kernels in a single module? Or do we have to include each kernel in a separate module?

You can multiple kernels in a module.

Can you give some details about how to debug cuda Fortran programs. The errors doesn’t show where exactly the program has failed.

Defensive programming, especially during development, is a good thing. Hence, I recommend adding error checking after each kernel launch. Something like:

call foo <<<N,M>>>(A,B,C)
istat = cudaGetLastError()
if (istat .ne. 0 ) then
   print *, "Error in Kernel foo: ", cudaGetErrorString(istat)
   stop
endif



Are there any special cuda debuggers available?

Yes.
Allinea: http://www.allinea.com/
CUDA 6.5 Toolkit: https://developer.nvidia.com/cuda-toolkit
PGDBG (in emulation mode)

Hope this helps,
Mat