64KB gotcha?

From the programming guide: Memory copies from host to device of a memory block of 64 KB or less happen asynchronously.

Doesn’t this type of behavior have a lot of potential for surprises? And what’s the advantage of making such an exception for small host-to-device copies as opposed to larger copies or copies from device to host?

It has been a long time since I have dealt with driver details, txbob can probably provide a more authoritative answer. I think the description in the Programming Guide is a tad misleading.

As far as I recall, what is happening is that these small host->device copies are performed by stuffing the data to be transferred into the GPU input queue. So the host data is grabbed synchronously to the host thread as with regular host->device copies. As seen by the host thread, the data is delivered asynchronously to the GPU, just as kernel launch commands are delivered asynchronously. Since the data is transported in-order with kernel launches via the same queue, there are no data dependency issues with a subsequent kernel that consumes the copied data. That is, as seen by kernels executing on the GPU, the arrival of the transferred data is synchronous.

So everything behaves like one would expect based on the CUDA programming model (no races) and what the guide describes is purely an implementation detail. As I recall this mechanism (data transport through GPU input queue) gives a significant performance boost for small data transfers by reducing latency and driver overhead.

Thanks very much for the explanation!