Is there any kind of Host <-> Device concurrency

Is there any way to overlap any kind of host <-> device access at all? We’re trying to build a “real time” system and the inability to do any kind of overlap is killing our timing.

Roughly what I’ve got is:

12ms host -> device copy
75ms misc. processing
21ms device ->host copy

The data is already being “compressed” for both copies.

Is there any way to overlap any of that so that if I were to do that “twice” (or constantly), the average would be less than 108ms? I think I’ve seen that we can’t do that, but even if I could just do simultaneous copies that would reduce the average time by 12ms. Is this possible on the hardware? Is this possible in current CUDA?


No, there is no way currently to overlap transfers and computation in CUDA.

Have you tried using page-locked system memory for the transfers (cudaMallocHost)? This can improve transfers times.

Yes of course, all our transfers are pagelocked. The issue is that we need to move a lot of data on and off of the board, and there’s no way around that. If we could be sending or recieving data while the next set processes that would be excellent.

Can you say if this is a CUDA limitation or a HW limitation, and when it might be “remedied”?


It’s a hardware limitation.

To be more precise, it is a limitation in this generation of GPUs. It will be possible to overlap in future ones.

But how is this a hardware limitation, as I remember it was possible to render and stream to GPU memory at the same time? Why not with cuda?

No, it has only been possible to use the CPU in concurrency with the memory transfers. The GPU has always waited for the memory transfers to complete, even with use of the pixel buffer objects.