What exactly is page locked memory? I understand it has higher bandwidth on the bus between CPU and GPU, resides on the host’s DRAM, and can be portable across CUDA contexts. But why not just use GPU memory directly? I’d much rather use GDDR3 any day, unless I know of the benefits of pinned memory on the host. So…
(1) what is it?
(2) why use it?
(3) when should i NOT use it?
From what I gather, Pinned memory is great if you are going to be copying data back and forth between the CPU and GPU quite often but may not be that beneficial if you’re not doing many transfers…
There is some confusion in this statement between page-locked memory and the zero-copy features of CUDA 2.2.
Page-locked memory by itself is only directly accessible to the host, but as the above post mentions, data can be copied between the page-locked memory and the GPU without an intermediate copy by the driver. With the exception of Core i7, there is generally a factor of two speed improvement on most systems in GPU memory transfer to/from page-locked host memory. (Triple channel Core i7 systems have so much memory bandwidth that there is almost no speed difference, amazingly, between page-locked and pageable memory.)
My general philosophy is to use page-locked host memory if I need to transfer anything to the GPU multiple times. The only reason not to use page-locked memory is if you don’t have control over memory allocation of the buffer (you have to use cudaMallocHost() to allocate page-locked memory) or if the size of your host buffer needs to be a significant fraction of the total memory in your computer. Page-locked memory cannot be swapped to disk (hence the name) and using too much of it reduces the amount of memory the OS has to devote to other tasks.
Zero-copy is an entirely new and different feature in CUDA 2.2. It lets you map a page-locked host buffer directly into the address space of the GPU, allowing you to perform CPU<->GPU transfers as needed while the kernel runs. The decision to use zero-copy memory depends a lot more on your algorithm specifics.
Also, page-locked memory is, by default, not portable between CUDA contexts. You can use it multiple host threads (with different contexts), but it will only be marked as page-locked in one that created it. In other contexts, it will be treated as pageable memory, triggering an unnecessary copy by the driver when used for GPU transfers. CUDA 2.2 also fixes this by allowing you to set a flag to allow all page-locked memory to be shared between OS threads.