unspecified launch failure

Hi,

I got an error message during a CUDA Fortran program run:
"
0: copyin Symbol Memcpy (dev=0x604e60, host=0x60ce30, size=4, offset=12132) FAILED: 4
"

Does this mean that I can’t copy too many variables from host to device?

System information:
OS: Windows XP SP3
CPU: Intel core i7-920
GPU: Geforce GTX 460
Compiler: PGI Accelerated Visual Fortran 11.1
CUDA Toolkit version: 3.2

The failed subroutine:

subroutine init_dev
use All_vars
use dev_vars
implicit none

! Main Parameters:
dte_dev = dte
Ramp_dev = Ramp
HORCON_dev = HORCON
HPRNU_dev = HPRNU
VPRNU_dev = VPRNU
UMOL_dev = UMOL
BFRIC_dev = BFRIC
Z0B_dev = Z0B
z0=z0b
CBCMIN = BFRIC

temp=(ZZ(KBM1)-Z(KB))/z0
zbz0_dev=temp
Print*,“Debug ---- 1”
z0_dev=z0b_dev
CBCMIN_dev=BFRIC_dev
Print*,“Debug ---- 2”

! Basic Extents:
nt_dev = nt
ns_dev = ns
nn_dev = nn
NTT_dev = NTT
Print*,“Debug ---- 3”
KB_dev = KB
KBM1_dev = KBM1
KBM2_dev = KBM2
numebc_dev = numebc
numqbc_dev = numqbc
numfbc_dev = numfbc
Print*,“Debug ---- 4”

… …

All “*_dev” variables are 4 bytes “real,constant” or “integer,constant”.

The screen displays:
"
Debug ---- 1
Debug ---- 2
0: copyin Symbol Memcpy (dev=0x604e60, host=0x60ce30, size=4, offset=12132) FAILED: 4
"

Does anybody know this problem? Thank you!


Bingray

Hi Bingray,

It’s your device to device copies that’s causing the problem. We are working on adding support for device to device copies, but it’s not available yet.

Changing:

z0_dev=z0b_dev
CBCMIN_dev=BFRIC_dev

to

z0_dev=Z0B  
CBCMIN_dev=BFRIC

Hope this helps,
Mat

Hi Mat, is the document ahead of the implementation? Cause from the new CUDA Fortran manual, I think this feature should be supported.

Tuan

Hi Tuan,

Actually the CUDA Fortran reference guide was written before implementation. There were a few things that turned out very difficult to implement (such are random_number) and have not been added to a release yet.

Though, device to device transfers located in global device memory are allowed, however the user in this case is trying to copy device to device located in constant memory. I should have been more clear in my response.

I’m not even sure if CUDA C allows for constant to constant data transfers. If they do, then this is more of a bug on our part since we obviously missed this. If they don’t, then we’ll most likely not be able to support it either.

Thanks for questioning me when my answers are not as clear as they should be.

  • Mat

Hi Mat,


Thanks a lot. It’s much better now after I changed the code. However, there is still a similar problem even after I cancelled all device-to-device copy. I have a very long loop in a “A3DMain.f” like this:

   Do nstep=1,50000
       THOUR = FLOAT(NSTEP) * DTE / 3600.
       RAMP = TANH(FLOAT(NSTEP)/FLOAT(IRAMP+1))
       call B1_dev
       call B2_dev
       call B3_dev
 ...
       call B10_dev
       call B11_dev
       call B12_dev
   Enddo

Each of the B*_dev subroutines is written in a .cuf file and calls several global kernels.
When nstep is 1, the code is running OK through B1_dev to B12_dev. But when nstep is 2, at the first transfer code of B1_dev:

	Ramp_dev = Ramp

, a similar error occurs:
“0: copyin Symbol Memcpy (dev=0x60eac0, host=0x616878, size=4, offset=12072) FAILED: 4”

I’ll check the entire code once again. I want to know what else would lead to this problem. Thanks!

Bingray

Hi Bingray,

Sometimes you see these types of unexplained errors when the Kernel launch before this memCopy failed. Are you checking the return status of your kernel launches? If not, try adding the following code after each one.

istat = cudathreadsynchronize()
errCode = cudaGetLastError()
if (errCode .gt. 0) then
       print *, 'ERROR in B12_dev:', errCode
       stop 'Error! Kernel failed!'
endif

Let me know if this finds anything.

  • Mat

Hi Mat,


I detected the failed kernel launch by monitoring the kernel’s running time. Checking the return status is a better way.

The failed code

	ELSEIF (UVM_dev(J)==0.5) THEN
		If(FSR_dev(I_S_dev(1,J))==1.0) Then
			I=I_S_dev(1,J)
		Else
			I=I_S_dev(2,J)
		Endif
		M1 = M_T_dev(1,I)
		M2 = M_T_dev(2,I)
		M3 = M_T_dev(3,I)
		J1 = J_T_dev(1,I)
		J2 = J_T_dev(2,I)
		J3 = J_T_dev(3,I)
		UX_dev(J,K) = ( UA_dev(J3,K)*(YN_dev(M2)-YN_dev(M1))      &
&        + UA_dev(J1,K)*(YN_dev(M3)-YN_dev(M2))                &
&        + UA_dev(J2,K)*(YN_dev(M1)-YN_dev(M3)) ) / ART_dev(I)
	ENDIF

was replaced by

	ELSEIF (UVM_dev(J)==0.5) THEN
		UX_dev(J,K) = 0.0
	ENDIF

and then it’s running fine.

But I can’t see why the original code fails. Under “release” configuration, the new code can run when compiled by PVF 11.1 with cuda toolkit 3.2, while it still fails when compiled by PVF 10.8 with cuda toolkit 3.1. Under “debug” configuration, the program can’t run and immediately finishes without a warning.

I’ll check the return status and look for other errors.

Thanks a lot!


Bingray

Hi Bingray,

My best guess is that you’re getting an access violation when indexing the UA_dev, YN_dev, or ART_dev arrays. What I would do first, is compile the code in emulation mode with array bounds checking enabled (-Mcuda=emu -Mbounds). If that didn’t show anything, I would then break up your “UX_dev=” expression and then comment out each line in turn until you can determine which array is causing the fault.

  • Mat