Hello, recently I have worked on GPU for application acceleration. Now I am perplexed about the low performance for CudaMallocHost.
I have many buffers *As and *Bs, which are all around 100MB. For each buffer, I need to copy some data from A to B, and they may also be copied to GPU when the application needs, so these buffers are created with CudaMallocHost for high transfer bandwidth and asynchronous transfer.
But the copy rate from A to B is low. When I create the buffer with C++ “new” operation without other change, the copy rate can increase from 4GB/s to around 6GB/s. I am just confused that why CudaMallocHost will hurt the access performance by CPU. Is it just page-locked for no replacement by the operating system? Has anyone met the same problem? Thanks!
I think this is expected behavior and you can find other reports like this.
I don’t have a precise description, but memory pages allocated with cudaHostAlloc or cudaMallocHost don’t have exactly the same characteristics as host memory pages allocated with an ordinary system allocator. In particular I believe the CPU caching characteristics are different, and this makes some sense if the memory region is used for communication between GPU and CPU.
I’m not aware of anything that can be done about it, other than not using pinned memory. THere are non-default flags available for cudaHostAlloc, but as far as I know none of these make the situation better.
You could try to use the cudaHostAlloc() function instead, and test whether any of the optional flags could give you a speed benefit
cudaHostAllocDefault: This flag's value is defined to be 0 and causes cudaHostAlloc() to emulate cudaMallocHost().
cudaHostAllocPortable: The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one that performed the allocation.
cudaHostAllocMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling cudaHostGetDevicePointer().
cudaHostAllocWriteCombined: Allocates the memory as write-combined (WC). WC memory can be transferred across the PCI Express bus more quickly on some system configurations, but cannot be read efficiently by most CPUs. WC memory is a good option for buffers that will be written by the CPU and read by the device via mapped pinned memory or host->device transfers.
OK, thanks for Robert_Crovella and cbuchner1 kindly help! I will make a try for these suggestions.
BTW, is there any document or whitepaper that has detailed described these properties? The CUDA Programming Guide has just given the usage but not the deep reason. I would like to research these memories and explain the reason.