Weird pageable <-> pinned memory performance

Tobi_W · June 3, 2009, 4:17pm

Hi everyone,

to did some memory testing with my GTX 285 and got some weird results. Pageable memory is much faster than pinned memory, if I transfer less than 64 KB. This is a real problem for me, because I have a streaming application which has to copy a lot of little data chunks to the device. Waiting until I have 64 KB data is simply not possible.

Is this a normal behaviour and I just have to live with it? I attached an image with my results (note: x-axis is logarithmic).

Thanks in advance…

tmurray · June 3, 2009, 4:38pm

you have a GTX 285, just use zero-copy for copying small results

kristleifur · June 3, 2009, 4:45pm

Hi,

do the transfer figures include any mallocs? Pinned mallocs are pretty slow AFAIK.

I’d also try an unpatched, vanilla Linux kernel if you’re really curious.

Tobi_W · June 3, 2009, 4:57pm

No, it just shows the raw transfer bandwith, without any mallocs. Although the system is pretty new and clean, I will test it with a new one…

@tmurray:
I have not used zero copy yet, but I will test it the next days. I thought mapped memory has the same performance of pinned memory, but I would be happy if I am mistaken.

Geka · June 3, 2009, 5:51pm

For this kind of sizes, if I were you I would look at latency rather than bandwidth…

Tobi_W · June 7, 2009, 12:07pm

I have implemented the mapped memory for my kernel, and the performance got even worse. But this is not really a surprise, because a single memory copy is now split into several smaller copies (as many as thread blocks are in the grid). The execution time doubled!

@Geka:
I do not really understand what you mean (maybe I’m just too dumb). I transfer about 6 kB of memory to the device over and over again, and the bandwidth affects directly the needed time. So what latency do you mean? RAM latency?

Tobi_W · June 10, 2009, 9:50am

As you can see in the attachement, the pinned memory with cudaMemcpyAsync() is always faster than pageable memory. This does not apply to the test in my first post (see attachement). The only difference between these two tests is, that the first one was made with cuda 2.1, the second with 2.2. I cannot reproduce the first result with cuda 2.2, so now I’m wondering whether there was a change in cuda 2.2 which change the performance.

Furthermore 2 questions about streams came to my mind:
Does someone know how much overhead is created when using streams?
If I have 2 streams with kernel calls, is it deterministic which kernel gets executed first?

Topic		Replies	Views
pageable and non-pageable memory CUDA Programming and Performance	2	6343	December 31, 2008
[SOLVED] cudaMemcpy down to 100Mbyte/s CUDA Programming and Performance	3	484	September 4, 2019
Pinned Memory slower than pageable memory CUDA Programming and Performance	4	3100	September 16, 2010
Pinned and Pageable memory CUDA Programming and Performance	5	2313	January 16, 2020
Advantages/Disadvantages of using pinned memory CUDA Programming and Performance	6	12954	May 4, 2018
bandwidthtest: pageable vs pinned memory CUDA Programming and Performance	4	1658	February 18, 2010
Zero Copy VS Page-Locked CUDA Programming and Performance	5	1114	September 19, 2011
The speed of data transfer between GPU and CPU CUDA Programming and Performance	4	2595	April 27, 2009
why using pinned memory is faster? CUDA Programming and Performance	3	2851	November 30, 2007
Performance effects of pinned memory CUDA Programming and Performance	5	1001	January 27, 2011

Weird pageable <-> pinned memory performance

Related Topics