I have a GTX 285. CPU to GPU bandwidth is not a big issue for me but the latency becomes a problem especially for small transfers. I know that GPU may not be the ideal solution if small data is exchanged between the CPU and the GPU frequently. I am using cudaMemcpy to transfer the data. As you can see in the attached figure for non-pinned memory the latency is in between 10 and 20 usec for data size of 10 bytes to 10 Kbytes. I assume this is DMA setup etc. time. Is there any way to reduce this latency? I searched all over the internet and did not find much about the latency.
For some reason x and y axis labels cannot be seen in the figure. X axis is data size in bytes and y axis is time it takes to transfer in usec for CPU to GPU transfer.
Thanks for the response. I was discouraged by the use pinned memory for small data transfers as you can see from the figure so I didn’t give that a try. I will try to zero-copy memory and will post the results here.
Hmm, Maybe you can find some info with searching the forums (with google worked best in the past). There have been people measuring kernel launch time and what it depended on. I believe launching an empty kernel takes around 5 usec, and the more parameters are in your kernel call, the longer it takes. Putting adresses of arrays (if they do not change) in constant memory space might help for example (that is how it is done in Fermi as far as I understand) but might make the kernel take more time.
You can also start playing ugly ugly hack games with persistent kernels in spinloops, looking at data coming in via zero-copy. That’s fraught with dangers and I won’t recommend it. However you get unbeatably low latency.