Pinned Memory zero copy No-copy pinning of system memory

Hi again,

In Cuda 4.0 description there is a “No-copy pinning of system memory, a faster alternative to cudaMallocHost()”…
Is there any difference or improvement there compared with previous versions?
I can barely find any info on that…


The improvement is that you can use it on memory that wasn’t allocated by yourself. It’s not faster than cudaMallocHost(), but it is faster than cudaMallocHost() plus copying all data into the newly allocated memory.

Hmmm that all sounds reasonable. Have you tried it? I have tried it and apparently it slowed down the application 4x. From what i have read this technique only performs
on integrated GPU cards. If the card is not integrated then it does not improve performance and if access in memory is not sequential then performance is seriously affected.

However, i have a feeling i might be missing something in here, has anybody else tried it?


“No-copy pinning of system memory” of course does not avoid copying the data to the GPU. It only avoids copying data on the CPU a second time if it happens to be in unpinned memory.

cudaHostRegister() should not slow down your application though. malloc()+cudaHostRegister() should be faster than cudaHostAlloc()+memcpy().
However I suspect you are comparing runtimes for zero-copy (data is processed directly from CPU memory without copying to GPU memory first (but still has to go through PCIe first except on integrated GPUs)) vs. transferring the data to GPU memory via cudaMemcpy() first. Zero-copy can be slower because of the high latency of the PCIe bus and because data may be transferred through PCIe multiple times.