Inconsistent runtime error in device-to-host copy


I have recently run into a strange bug in the program I am working on. The error occurs at a device-to-host array assignment, but it does not happen every time I run the program. The code for the relevant subroutine is below:

subroutine CubicInterpVec3D(coords, result)

		real(real_kind), dimension(:,:) :: coords
		real(real_kind), dimension(:) :: result

		integer :: nCoords, dimGrid, dimBlock
		real(real_kind), device, allocatable, dimension(:,:) :: coordsDev
		real(real_kind), device, allocatable, dimension(:) :: resultDev

		if(allocFlag==1) then

			nCoords = size(result)
			if(size(coords,1) .ne. nCoords) then
				print *, 'Number of coordinates is not equal to the number of desired interpolated values!'
				stop 'Program terminated by cubic_bspline_interp_3D_mod:CubicVec3D'

			print *, 'Attempting to allocate device memory...'
			allocate( coordsDev(nCoords, 3), resultDev(nCoords) )
			print *, 'Attempting to copy test points to device...'			
			coordsDev = coords(1:nCoords, 1:3)

			print *, 'Attempting to call the kernel...'
			dimBlock = 16
			dimGrid = max(1,nCoords/dimBlock+1)
			call CubicInterpVec3D_kernel<<<dimGrid,dimBlock>>>(coordsDev,resultDev,nCoords)

			print *, 'Attempting to copy results back to host...'
			!istat = cudaMemcpy(result,resultDev,nCoords)

			print *, 'Deallocating device memory...'
			print *, 'Coefficient matrix not allocated on device yet!'
			stop 'Program terminated by cubic_bspline_interp_3D_mod:CubicInterpVec3D'

end subroutine CubicInterpVec3D

The error occurs at


As you can see I have also tried using cudaMemcpy, but the same intermittent error shows up. The error is:

copyout Memcpy (host=0x16edf00, dev=0x1f94b00, size=200) FAILED:4

I am running a 9800GT on 64-bit Ubuntu Linux. Any help would be appreciated. I can post the full code if anyone needs it, but it doesn’t seem to be relevant as the error shows up at the very end of the program, after all the kernel calls and other stuff.

Hi Joe,

My best guess is that your kernel is failing. Try adding the following code just after the call to your kernel to see if I’m correct.

! Check for errors
    errCode = cudaGetLastError()
    if (errCode .gt. 0) then
       print *, cudaGetErrorString(errCode)
  • Mat

No errors are reported. Also, I tried running the program in emulator mode and it causes a segmentation fault at exactly that point every time (so the inconsistency issue is not there in emulator mode).

In case anyone cares, it was a stupid typo in the kernel after all that caused a write outside a device array’s bounds. It didn’t trigger a kernel error though. It’s odd that this caused a segfault consistently in emulation mode but only once in a while in GPU mode.