Has anyone had any success copying memory to a GPU while a global function is executing.?
I was talking with an NVIDIA hardware engineer who said that there were no hardware limitations for performing a DMA copy to the GPU while GPU threads are executing. However, when I launch two host threads, have the first thread launch a global function and the second one do a memory copy to the device, the results are never visible to the global function (it sits in a loop polling the intended destination of the memory copy forever).
If this is not a hardware limitation, will it be fixed in a future revision of CUDA? Using this technique should significantly reduce the latency of operations on large data sets since the computation can begin as soon as a small amount of data is received rather than waiting for potentially hundreds of MB to be transferred,