CPU operation is very slow on memory allocated by cudaMallocHost

The speed of copying data between GPU and CPU is faster when I use cudaMallocHost(rather than malloc) to allocate host memory(let’s say hostMem).

However CPU operation on hostMem is much slower, is there a method I can allocate memory that could make copying faster but doesn’t slow CPU operation?

I found from some other topics that pinned memory(allocated by cudaMallocHost) didn’t use cache which is the reason why CPU operation is slow on pinned memory.

Is there a faster way I can do CPU operation on this pinned memory allocated by cudaMallocHost.


The source of the cudaMallocHost overhead maybe due to data allocated with cudaHostAlloc() are marked “uncacheable”.