CPU to GPU data transfer latency

Hi all,

I have a GTX 285. CPU to GPU bandwidth is not a big issue for me but the latency becomes a problem especially for small transfers. I know that GPU may not be the ideal solution if small data is exchanged between the CPU and the GPU frequently. I am using cudaMemcpy to transfer the data. As you can see in the attached figure for non-pinned memory the latency is in between 10 and 20 usec for data size of 10 bytes to 10 Kbytes. I assume this is DMA setup etc. time. Is there any way to reduce this latency? I searched all over the internet and did not find much about the latency.

For some reason x and y axis labels cannot be seen in the figure. X axis is data size in bytes and y axis is time it takes to transfer in usec for CPU to GPU transfer.

Thanks in advance…

Selcuk
CPUtoGPU.PNG

CPUtoGPU.PNG

Have you tried zero-copy memory? That might help a lot.

Thanks for the response. I was discouraged by the use pinned memory for small data transfers as you can see from the figure so I didn’t give that a try. I will try to zero-copy memory and will post the results here.

The big plus is that you only have a kernel launch, so you skip the data-transfer latency, and you can have some PCIe latency hiding if your kernel performs real work.

Or if you do not have a lot of work to do in your kernel the time for data-transfer+kernel might equal to the time spent in data-transfer alone before.

Hi Denis,

I have tried this and it works. Now the bottleneck is the kernel launch time. It is in the order of 12 usec. I guess there is no way to reduce this down, is there?

Hmm, Maybe you can find some info with searching the forums (with google worked best in the past). There have been people measuring kernel launch time and what it depended on. I believe launching an empty kernel takes around 5 usec, and the more parameters are in your kernel call, the longer it takes. Putting adresses of arrays (if they do not change) in constant memory space might help for example (that is how it is done in Fermi as far as I understand) but might make the kernel take more time.

I am afraid some experimentation will be needed.

You can also start playing ugly ugly hack games with persistent kernels in spinloops, looking at data coming in via zero-copy. That’s fraught with dangers and I won’t recommend it. However you get unbeatably low latency.