I am little bit puzzled with the reason behind having faster memory transfers when using pinned memory.
As far as I understood, using pinned memory will allow using DMA when transferring data from host to device memory since DMA operates only with physical addresses, not with virtual address. On the other hand, when using pageable memory (malloc), then this is not the case, hence ordinary move instructions used for the transfer, therefore, it is slower. Is this right?
The GPU always must DMA from pinned memory. If you use malloc() for your host data, then it is in pageable (non-pinned memory). When you call cudaMemcpy(), the CUDA driver has to first memcpy the data from your non-pinned pointer to an internal pinned memory pointer, and then the host->GPU DMA can be invoked.
If you allocate your host memory with cudaMallocHost and initialize the data there directly, then the driver doesn’t have to memcpy from pageable to pinned memory before DMAing – it can DMA directly.
Some network (and disk) adapters support scatter/gather DMA, which allows the host to send a list of physical addresses and thus DMA from and into the virtual memory space (of course, it still won’t work if the virtual memory is swapped out to disk).
But as G80 (afaik) doesn’t support this, pinning is really neccesary.