Performance effects of pinned memory

lc412 · January 26, 2011, 2:29pm

Hi All,

I’m considering switching some of my code over to pinned memory. Does anyone have any experience with the performance effects of the switch?

In particular, I am running a kernel that is roughly similar to a matrix-matrix multiply. I can use standard memory and do the following:
(1) allocate memory on CPU
(2) copy memory to the GPU
(3) execute the kernel

or use pinned memory like:
(1) allocate memory (on the CPU) using cudaHostAlloc
(2) get the pointer to it (in device space) using cudaHostGetDevicePointer
(3) execute the kernel.

In the second option, the memory copy is done implicitly. I tried both of these options out and found that the second one (using pinned memory) was three to four times as slow (30.5 sec vs 8.5 sec) as the first. Does this seem right?

(I’m on a Tesla C2050).

Thanks

lc412 · January 26, 2011, 2:37pm

Additional info:

Both of the matrices easily fit into the device memory (simultaneously). Hence I was thinking that there would be some overhead associated with using the pinned memory, but that it wouldn’t be so drastic.

Nighthawk13 · January 26, 2011, 3:00pm

Two problems i can think of:
1.) Mapped data is read more than once
2.) Mapped data is not accessed in optimal pattern for burst transfers(thread0 read address 128-byte aligned, each thread of a warp reads a successive dword/float)

If you have many matrices to process, it might be easiest to let each kernel (copy data N+1 from mapped pointer to device mem, process matrix N in device mem, store data from N-1 from device to mapped pointer)

lc412 · January 26, 2011, 4:01pm

I suspect you are correct. I was hoping that (1) wouldn’t occur because of some caching policy, but I can’t find anything about how that behaves (ie, when does cuda kick something out of device memory?). (2) is interesting… I hadn’t considered that the memory read pattern should be different when working with pinned memory. Is this documented somewhere?

Unfortunately, I’m in a situation where there are only a couple of matrices, but at least one won’t fit into the device memory. I was hoping to avoid some of the complexities of memory management with this pinned memory…

thanks!

Nighthawk13 · January 27, 2011, 11:18am

(1) I don’t think any access to host memory is cached: the GPU never knows whether the CPU modified the pinned memory, so it has to retransfer it on every access.

(2) Not sure about how flexible mapped memory accesses are. But i can’t think of a more GPU-friendly access pattern than the coalescing pattern of the old G80 GPUs.

No hard sources on this, just my educated guess what causes your performance drops.

Is it possible to split the matrix calculation kernel into parts(group of 32 rows/columns for example)? Then apply the streaming pattern i described(load group N, calc group N-1, store group N-2).

Otherwise I’m out of ideas, I would stick to memcpy then and try cuda stream for concurrent uploads.

lc412 · January 27, 2011, 4:58pm

Yeah, that sounds right to me. Then that is almost certainly the source of the slow-down.

I see. I originally optimized the code for a G80, so the memory access pattern is probably not the problem here…

Yeah, I think that’s the best approach, but it’s quite a bit of work to get it all implemented External Image.

Topic		Replies	Views
A few general questions... CUDA Programming and Performance	2	3070	October 12, 2009
Is it possible to use pinned memory? Outside of CUDA CUDA Programming and Performance	14	6278	January 22, 2025
cudaMemcpyDeviceToHost - slow performance using pinned memory CUDA Programming and Performance	6	2821	June 24, 2016
Question about Pinned memory CUDA Programming and Performance	8	1903	June 16, 2016
CUDA device memory access? CUDA Programming and Performance	11	15696	August 5, 2011
Advantages/Disadvantages of using pinned memory CUDA Programming and Performance	6	13548	May 4, 2018
Unified Memory vs Pinned Host Memory vs GPU Global Memory CUDA Programming and Performance	9	8799	June 1, 2022
Performance issues accessing pinned memory 1070 / 3060 CUDA Programming and Performance	9	327	April 21, 2023
How to transfer massive data efficiently? CUDA Programming and Performance	5	5823	April 16, 2015
Pinned memory size problem CUDA Programming and Performance	4	3901	December 11, 2009

Performance effects of pinned memory

Related topics