I would expect the output “result: 45” (for a little endian machine) or “result: 23” (for a big endian machine). However running this on the GPU gives me “result: 0”. Running it in device emulation mode gives me the output “result: 45”, as I would expect. Does anyone see anything I’m doing wrong?
You have missed “cudaThreadSynchronize()” after your kernel call. So, you are copying out the results without ensuring that the kernel has completed execution. Thats y.
I don’t think this is true. From the Guide:
“Any kernel launch, memory set, or memory copy for which a zero stream parameter has been specified begins only after all preceding operations are done, including operations that are part of other streams, and no subsequent operation may begin
until it is done.”
So the cudaMemcpy() only proceed after the kernel has finished executing.
Implicit thread synchronization has been around since the 0.8 beta (and earlier, I would guess).
Any time you do something in CUDA that touches global memory (even async operations) will wait in a queue on the GPU and run after the previous operation finishes. If that operation involves copying to the host, there is an implicit cudaThreadSynchronize().
I think the problem is that you are using pointer arithmetic inside a kernel. However, the local variables are actually stored in registers so you can’t actually do pointer arithmetic on them.