memory copy overlap

Hi everyone,

Does CUDA allow us to overlap memory copy from host to device and host code? To clarify, can we transfer a large amount of data to GPU while executing code either on CPU or GPU?


You can overlap memory transfers with operations on the CPU by using the Async versions of the memcpy calls. You can overlap a memory copy with a kernel execution by using the streams API (see the programming guide), but only compute 1.1 hardware (G92 and newer) can do the overlap. Compute 1.0 hardware will serialize the operations.

Is this correct?

“compute 1.1 hardware (G92 and newer)”

I have an 8600GTS - it says major=1 minor=1 and it is G84. Tried both Async and Streams on it and they both work.

I think the G84 came out later than the G92.

Nope, G84 is here for a year or so. I belive it’s just a typo in doc.

Where in the documentation does it say that compute hardware 1.1 is necessary to overlap asynchronous host to device memory transfers and kernel execution?

Section says:

This post from NVIDIA mentions that only compute 1.1 devices have this capability:…ndpost&p=292323

CPU/GPU concurrency via cuMemcpy*Async is an artifact of the GPU and CPU being separate devices, so is available on all CUDA-capable hardware.

The “async memcpy” capability that is available only on compute 1.1 devices, is the ability to overlap host<->device memcpy with kernel execution. This is a separate level of concurrency but required very similar synchronization primitives, so the same APIs are used to access the functionality. (In any case, the expectation was that anyone who wanted memcpy/kernel concurrency also would want the API calls to be asynchronous, i.e. CPU/GPU concurrency.)