Overlapping kernel execution and memory copy

Has anyone had any success copying memory to a GPU while a global function is executing.?

I was talking with an NVIDIA hardware engineer who said that there were no hardware limitations for performing a DMA copy to the GPU while GPU threads are executing. However, when I launch two host threads, have the first thread launch a global function and the second one do a memory copy to the device, the results are never visible to the global function (it sits in a loop polling the intended destination of the memory copy forever).

If this is not a hardware limitation, will it be fixed in a future revision of CUDA? Using this technique should significantly reduce the latency of operations on large data sets since the computation can begin as soon as a small amount of data is received rather than waiting for potentially hundreds of MB to be transferred,

This has been discussed many times before. NVIDIA reps have always responded stating that it cannot be done due to hardware limitations, hinting that future generations of the hardware may not have such limitations.

I can see how a pipelined set of computations could benefit this feature, but how could a single one? The device will start dozens of blocks running concurrently when you make a kernel call in an undefined order. There is no guaruntee that a block reading the end of the memory being transferred is not executed first.

I agree that there are synchronization problems because you cannot guarantee the order in which operations would complete. However, there are several ways that this could be overcome. The simplest would be to encode the data being transferred in a format that threads executing on the device would be able to immediately recognize if it was valid or not. If valid data was not found, the threads would continue to poll until it was.

A better way would be to provide some notification mechanism from the host to the device to indicate that an operation was completed, but would not affect the operation of threads on the device. It is possible to implement asynchronous circular buffers in software where the host would push data into the buffer and update a head of buffer pointer until the buffer is full and the device would pull data out of the buffer and update a tail of buffer pointer until the buffer is empty, the host blocks if it detects that the buffer is full and the device blocks if it detects the buffer is empty. Even better would be hardware implementations of such buffers.

Right, as long as you’re willing to poll, then you could have a buffer with an associated ‘id number’ and write a steadily increasing id number after you’ve written the data. Of course, you might need some synchronization back to the host, too (to be sure that the GPU is done with the previous buffer).

Supposing that there’s actually more GPU computation than transfer, the inefficiency of making a few small ‘synchronization transfers’ could be entirely hidden within computation.

The issue of whether this is possible in hardware is interesting. We used to be a OpenGL / DirectX GPGPU shop, and we did simultaneous transfer and computation under OpenGL on a GeForce 7 (at least, it was supposedly simultaneous; we never did realize great real-world performance). But it was supposedly possible to transfer OpenGL textures and run OpenGL shader program simultaneously (maybe the spec allowed this but the implementation didn’t really do it?).

It could be possible that the hardware to do this is switched off in CUDA mode - there are a few other things that we’ve heard can’t be done in CUDA mode that could be done in OpenGL mode (all those tricks with depth test, for example).

Curious as to where the limitations are. It sounds like they will be overcome in some generation relatively soon, though.


Coming very soon

Hmmn, perhaps:

“… CUDA v1.1’s improved support for CPU/GPU overlap. The CPU will be able to continue executing code (including the driver) while the GPU is memcpy’ing or processing data, so driver overhead at least will be hidden as long as the GPU is busy…”

It’s not entirely clear from this statement that 1.1 will allow the overlap of memcpy and processing data. A plausible reading of this is that the CPU instructions in the driver can overlap with either memcpy OR processing data only. One might be able to hide most of the latency of a small transfer followed by a kernel launch, but I don’t think that the statement you linked to is indicative of memcpy/kernel overlap in CUDA 1.1. I could be wrong on this, of course.

You are most likely correct. I keep on falling in the hole of being overly optimistic… When technical stuff is written I am sure they think what is the MOST we can say about what has been done, and so when reading Nvidiaese one needs to think of what is the LEAST they could have done and still make the claim.

In this case it would be sufficient to enable queuing of transfers in the driver along with kernel calls to satisfy the statement, even a new synchronisation mechanism might be just we can poll the destination buffer of the last transfer in a chain for completion when we really need an interrupt driven synchronisation mechanism.

This enhancement is probably quite worthwhile as latencies can be significantly reduced. In the current chained kernel environment the first kernel seems to have a min latency of 70us (often double that) while a chained kernel seems to have a launch time closer to 12-14us (for the same kernel - have not measured different ones).