Why are transfers faster for cudaMallocHost? Even after I page lock "regular" memory.


I tried page locking a buffer using VirtualLock, but there doesn’t seem to be any improvement at all in data transfer speeds from CPU to GPU.

However, memory allocated by cudaMallocHost is able to transfer to the GPU at about 2X the speed of “regular” memory! Why is this happening? (Possibly alignment issues?)

Do you guys have ideas on this?


From what I’ve read on this forum it works like this:

cudaMallocHost allocates page locked memory AND registers this memory space with the CUDA driver. Therefore when a memory transfer is initiated CUDA can DMA directly into that memory space and that’s a lot faster.
When using VirtualLock the memory gets not registered with CUDA and CUDA treats this memory as if it was allocated in a regular way.

Hope this vague explanation helps to shed some light on the matter.

I see.

I was hoping that just locking the memory before the transfer might be a way of speeding up the transfer. A stupid idea in retrospect, because if it were, CUDA would’ve done it for me! :)