Overlapping CPU<->GPU trasnfer and kernel computation only for pinned memory

Is there a technical reason why the overlapping of CPU <-> GPU transfer and kernel computation works only for buffers which were allocated as ‘pinned’ (non-pagabele) CPU memory ? The problem with this is, I often have some buffer which were allocated in some third-party library as normal page-able memory.

I also want to know the reason. If they could make normal memory copy from device to host without non-pageable memory, why couldn’t they do so with overlapping copy as well? A problem on the PCI bus?

paged memcpys are staged through a pinned buffer using CPU-side memcpys, whereas pinned memcpys are performed only via DMAs. so in theory we could do it, if we have a background thread doing CPU side memcpys and synchronizing with the GPU, etc.

(really you should try using cudaHostRegister in 4.0, this is why it’s there)

@tmurray -> Thx for the tip. Seems that ‘cudaHostRegister’ is what I need. Nice …
http://developer.download.nvidia.com/compute/cuda/4_0/CUDA_Toolkit_4.0_Overview.pdf (page 6).