I’m considering switching some of my code over to pinned memory. Does anyone have any experience with the performance effects of the switch?
In particular, I am running a kernel that is roughly similar to a matrix-matrix multiply. I can use standard memory and do the following:
(1) allocate memory on CPU
(2) copy memory to the GPU
(3) execute the kernel
or use pinned memory like:
(1) allocate memory (on the CPU) using cudaHostAlloc
(2) get the pointer to it (in device space) using cudaHostGetDevicePointer
(3) execute the kernel.
In the second option, the memory copy is done implicitly. I tried both of these options out and found that the second one (using pinned memory) was three to four times as slow (30.5 sec vs 8.5 sec) as the first. Does this seem right?
(I’m on a Tesla C2050).