Weird pageable <-> pinned memory performance

Hi everyone,

to did some memory testing with my GTX 285 and got some weird results. Pageable memory is much faster than pinned memory, if I transfer less than 64 KB. This is a real problem for me, because I have a streaming application which has to copy a lot of little data chunks to the device. Waiting until I have 64 KB data is simply not possible.

Is this a normal behaviour and I just have to live with it? I attached an image with my results (note: x-axis is logarithmic).

Thanks in advance…

you have a GTX 285, just use zero-copy for copying small results


do the transfer figures include any mallocs? Pinned mallocs are pretty slow AFAIK.

I’d also try an unpatched, vanilla Linux kernel if you’re really curious.

No, it just shows the raw transfer bandwith, without any mallocs. Although the system is pretty new and clean, I will test it with a new one…

I have not used zero copy yet, but I will test it the next days. I thought mapped memory has the same performance of pinned memory, but I would be happy if I am mistaken.

For this kind of sizes, if I were you I would look at latency rather than bandwidth…

I have implemented the mapped memory for my kernel, and the performance got even worse. But this is not really a surprise, because a single memory copy is now split into several smaller copies (as many as thread blocks are in the grid). The execution time doubled!

I do not really understand what you mean (maybe I’m just too dumb). I transfer about 6 kB of memory to the device over and over again, and the bandwidth affects directly the needed time. So what latency do you mean? RAM latency?

As you can see in the attachement, the pinned memory with cudaMemcpyAsync() is always faster than pageable memory. This does not apply to the test in my first post (see attachement). The only difference between these two tests is, that the first one was made with cuda 2.1, the second with 2.2. I cannot reproduce the first result with cuda 2.2, so now I’m wondering whether there was a change in cuda 2.2 which change the performance.

Furthermore 2 questions about streams came to my mind:
Does someone know how much overhead is created when using streams?
If I have 2 streams with kernel calls, is it deterministic which kernel gets executed first?