cudaMallocHost allocates fixed memory to the CPU. Can this memory (pMem) be used by the GPU and CPU together?
Or do I have to cudamemcpy to use it on GPU?
pMem will be accessible by both gpu and cpu if you use cudaMallocManaged using CUDA Unified Memory. With cudaMallocHost you have to cudamemcpy to access from GPU.
directly means that you can access memory asynchronously in relation to the CPU. The device’s pointer should not be the same, and you still have to copy explicitly with memcpy or memcpyasync.
While cudaMallocHost memory can be accessed by both CPU/GPU, at least in my latest project with Turing/7.5/11.0, the profiler shows it “living” in CPU/system memory and there was a device performance impact in my case.
When using it for an input buffer (CPU write once and sequential GPU 32-bit reads), was noticeably slower than allocating device memory and cudaMemcpy to transfer from host to device. Possible that more optimized reads might hide the impact. (Or possible the cost is shifted to the cudaMemcpy rather than during kernel runtime, but the former is highly optimized.)
I continue using cudaMallocHost for output buffers (cannot beat the convenience) and very small input buffers (a few K or less). Was surprised when a 256K input buffer showed such an impact.