CPU operation is very slow on memory allocated by cudaMallocHost

The original thread is posted here, then maybe it’s more appropriate to post on TensorRT branch.
https://devtalk.nvidia.com/default/topic/1042530/cpu-operation-is-very-slow-on-memory-allocated-by-cudamallochost-/#5288277

The speed of copying data between GPU and CPU is faster when I use cudaMallocHost(rather than malloc) to allocate host memory(let’s say hostMem).

However CPU operation on hostMem is much slower, is there a method I can allocate memory that could make copying faster but doesn’t slow CPU operation?

I found from some other topics that pinned memory(allocated by cudaMallocHost) didn’t use cache which is the reason why CPU operation is slow on pinned memory.

Is there a faster way I can do CPU operation on this pinned memory allocated by cudaMallocHost.

Hello,

I think this is more suitable for discussion in “CUDA Programming and Performance”
https://devtalk.nvidia.com/default/board/57/cuda-programming-and-performance/

The source of the cudaMallocHost overhead maybe due to data allocated with cudaHostAlloc() are marked “uncacheable”.