cuMemcpy Overhead cuMemcpy call has unacceptable overhead

I am trying to apply CUDA in a real-time hardware-in-the-loop simulation. My simulation time step is 50 usec. I have timed a single call of cuMemcpyDtoH transfering the smallest amount of data at 18 usec. Assuming that cuMemcpyHtoD takes the same amount of time, I will be spending 36 usec for just memory transfers leaving only 14 usec for kernel exection. However, my kernel requires approximately 25 usec. I must find a way to reduce the memory transfer overhead. Any help would be appreciated.


If you’re transferring a 2D array then maybe if it was allocated with cudaMallocPitch it might be slightly faster… that’s the only idea I have, really.

I don’t know if this would work for your application, but transferring data for the next calculation to be done while the calculation is being done might help…

Most of that 25 usec is probably the kernel launch overhead, too…

The values you quote are typical for the overhead in cuda/cuMemcpy for even a 4-byte transfer. There really isn’t anyway around this except for trying different versions of CUDA on different OS’s. IIRC, when I last benchmarked the overhead CUDA 2.0 on 64-bit linux had the lowest overhead by a small margin.

As your kernel is doing such a tiny amount of work, can’t you just implement this loop on the CPU? It is likely to be faster.

I don’t know if I understood this correctly, but IMO you have at least around 20µs overhead per kernel invocation from host side (was mentioned in other threads of this forum). So even if the host would be doing nothing else you wouldn’t go under ~60µs for 2x memCpy and one calc.

I don’t know how “real” your realtime device has to run, but maybe if some constant latency wouldn’t hurt you could collect, copy and work on bigger chunks of your data.

Use two GPUs controlled by 2 threads on a Dual Core host CPU. Offset the timing of the threads by 50 us each. Have one thread perform the even iterations, the other thread perform the odd iterations.

This way each GPU has 100 us to complete everything (2xcopy, 1xkernel call), giving you a bit of extra margin even.

But it only works if there is no direct data dependencies between the iterations. Hmm…


Have you tried using page locked memory already? The GPU can then use DMA transfer

I did try page locked memory and only saw minimal improvement.