I’m considering switching some of my code over to pinned memory. Does anyone have any experience with the performance effects of the switch?
In particular, I am running a kernel that is roughly similar to a matrix-matrix multiply. I can use standard memory and do the following:
(1) allocate memory on CPU
(2) copy memory to the GPU
(3) execute the kernel
or use pinned memory like:
(1) allocate memory (on the CPU) using cudaHostAlloc
(2) get the pointer to it (in device space) using cudaHostGetDevicePointer
(3) execute the kernel.
In the second option, the memory copy is done implicitly. I tried both of these options out and found that the second one (using pinned memory) was three to four times as slow (30.5 sec vs 8.5 sec) as the first. Does this seem right?
Both of the matrices easily fit into the device memory (simultaneously). Hence I was thinking that there would be some overhead associated with using the pinned memory, but that it wouldn’t be so drastic.
Two problems i can think of:
1.) Mapped data is read more than once
2.) Mapped data is not accessed in optimal pattern for burst transfers(thread0 read address 128-byte aligned, each thread of a warp reads a successive dword/float)
If you have many matrices to process, it might be easiest to let each kernel (copy data N+1 from mapped pointer to device mem, process matrix N in device mem, store data from N-1 from device to mapped pointer)
I suspect you are correct. I was hoping that (1) wouldn’t occur because of some caching policy, but I can’t find anything about how that behaves (ie, when does cuda kick something out of device memory?). (2) is interesting… I hadn’t considered that the memory read pattern should be different when working with pinned memory. Is this documented somewhere?
Unfortunately, I’m in a situation where there are only a couple of matrices, but at least one won’t fit into the device memory. I was hoping to avoid some of the complexities of memory management with this pinned memory…
(1) I don’t think any access to host memory is cached: the GPU never knows whether the CPU modified the pinned memory, so it has to retransfer it on every access.
(2) Not sure about how flexible mapped memory accesses are. But i can’t think of a more GPU-friendly access pattern than the coalescing pattern of the old G80 GPUs.
No hard sources on this, just my educated guess what causes your performance drops.
Is it possible to split the matrix calculation kernel into parts(group of 32 rows/columns for example)? Then apply the streaming pattern i described(load group N, calc group N-1, store group N-2).
Otherwise I’m out of ideas, I would stick to memcpy then and try cuda stream for concurrent uploads.