CPU operation is very slow on memory allocated by cudaMallocHost

The original thread is posted here, then maybe it’s more appropriate to post on TensorRT branch.

The speed of copying data between GPU and CPU is faster when I use cudaMallocHost(rather than malloc) to allocate host memory(let’s say hostMem).

However CPU operation on hostMem is much slower, is there a method I can allocate memory that could make copying faster but doesn’t slow CPU operation?

I found from some other topics that pinned memory(allocated by cudaMallocHost) didn’t use cache which is the reason why CPU operation is slow on pinned memory.

Is there a faster way I can do CPU operation on this pinned memory allocated by cudaMallocHost.


I think this is more suitable for discussion in “CUDA Programming and Performance”

The source of the cudaMallocHost overhead maybe due to data allocated with cudaHostAlloc() are marked “uncacheable”.