It’s your device to device copies that’s causing the problem. We are working on adding support for device to device copies, but it’s not available yet.
Actually the CUDA Fortran reference guide was written before implementation. There were a few things that turned out very difficult to implement (such are random_number) and have not been added to a release yet.
Though, device to device transfers located in global device memory are allowed, however the user in this case is trying to copy device to device located in constant memory. I should have been more clear in my response.
I’m not even sure if CUDA C allows for constant to constant data transfers. If they do, then this is more of a bug on our part since we obviously missed this. If they don’t, then we’ll most likely not be able to support it either.
Thanks for questioning me when my answers are not as clear as they should be.
Thanks a lot. It’s much better now after I changed the code. However, there is still a similar problem even after I cancelled all device-to-device copy. I have a very long loop in a “A3DMain.f” like this:
Each of the B*_dev subroutines is written in a .cuf file and calls several global kernels.
When nstep is 1, the code is running OK through B1_dev to B12_dev. But when nstep is 2, at the first transfer code of B1_dev:
Ramp_dev = Ramp
, a similar error occurs:
“0: copyin Symbol Memcpy (dev=0x60eac0, host=0x616878, size=4, offset=12072) FAILED: 4”
I’ll check the entire code once again. I want to know what else would lead to this problem. Thanks!
Sometimes you see these types of unexplained errors when the Kernel launch before this memCopy failed. Are you checking the return status of your kernel launches? If not, try adding the following code after each one.
istat = cudathreadsynchronize()
errCode = cudaGetLastError()
if (errCode .gt. 0) then
print *, 'ERROR in B12_dev:', errCode
stop 'Error! Kernel failed!'
endif
ELSEIF (UVM_dev(J)==0.5) THEN
UX_dev(J,K) = 0.0
ENDIF
and then it’s running fine.
But I can’t see why the original code fails. Under “release” configuration, the new code can run when compiled by PVF 11.1 with cuda toolkit 3.2, while it still fails when compiled by PVF 10.8 with cuda toolkit 3.1. Under “debug” configuration, the program can’t run and immediately finishes without a warning.
I’ll check the return status and look for other errors.
My best guess is that you’re getting an access violation when indexing the UA_dev, YN_dev, or ART_dev arrays. What I would do first, is compile the code in emulation mode with array bounds checking enabled (-Mcuda=emu -Mbounds). If that didn’t show anything, I would then break up your “UX_dev=” expression and then comment out each line in turn until you can determine which array is causing the fault.