Low performance for CPU accessing page-locked memory?

foxspy · March 6, 2019, 9:59am

Hello, recently I have worked on GPU for application acceleration. Now I am perplexed about the low performance for CudaMallocHost.

I have many buffers *As and *Bs, which are all around 100MB. For each buffer, I need to copy some data from A to B, and they may also be copied to GPU when the application needs, so these buffers are created with CudaMallocHost for high transfer bandwidth and asynchronous transfer.

But the copy rate from A to B is low. When I create the buffer with C++ “new” operation without other change, the copy rate can increase from 4GB/s to around 6GB/s. I am just confused that why CudaMallocHost will hurt the access performance by CPU. Is it just page-locked for no replacement by the operating system? Has anyone met the same problem? Thanks!

Robert_Crovella · March 6, 2019, 1:58pm

I think this is expected behavior and you can find other reports like this.

I don’t have a precise description, but memory pages allocated with cudaHostAlloc or cudaMallocHost don’t have exactly the same characteristics as host memory pages allocated with an ordinary system allocator. In particular I believe the CPU caching characteristics are different, and this makes some sense if the memory region is used for communication between GPU and CPU.

I’m not aware of anything that can be done about it, other than not using pinned memory. THere are non-default flags available for cudaHostAlloc, but as far as I know none of these make the situation better.

cbuchner1 · March 6, 2019, 3:22pm

You could try to use the cudaHostAlloc() function instead, and test whether any of the optional flags could give you a speed benefit

cudaHostAllocDefault: This flag's value is defined to be 0 and causes cudaHostAlloc() to emulate cudaMallocHost().
    cudaHostAllocPortable: The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one that performed the allocation.
    cudaHostAllocMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling cudaHostGetDevicePointer().
    cudaHostAllocWriteCombined: Allocates the memory as write-combined (WC). WC memory can be transferred across the PCI Express bus more quickly on some system configurations, but cannot be read efficiently by most CPUs. WC memory is a good option for buffers that will be written by the CPU and read by the device via mapped pinned memory or host->device transfers.

foxspy · March 7, 2019, 12:50am

OK, thanks for Robert_Crovella and cbuchner1 kindly help! I will make a try for these suggestions.

BTW, is there any document or whitepaper that has detailed described these properties? The CUDA Programming Guide has just given the usage but not the deep reason. I would like to research these memories and explain the reason.

Topic		Replies	Views
cudaHostAllocMapped CUDA Programming and Performance	5	8231	October 15, 2009
cudaHostAlloc performance is slow CUDA Programming and Performance	1	1202	June 26, 2012
Does the page-lock memory by cudaHostRegister slow than cudaMallocHost? CUDA Programming and Performance	9	949	June 30, 2023
Is cudaHostAlloc() fast? CUDA Programming and Performance	5	853	March 28, 2024
CPU operation is very slow on memory allocated by cudaMallocHost TensorRT	1	873	October 8, 2018
Pinned Memory slower than pageable memory CUDA Programming and Performance	4	3309	September 16, 2010
CPU operation is very slow on memory allocated by cudaMallocHost CUDA Programming and Performance	0	408	October 9, 2018
Is cudaMallocHost allocated physical memory? CUDA Programming and Performance	6	1223	July 15, 2020
page-locked memory CUDA Programming and Performance	4	11402	November 10, 2008
Transfer Speed For AWE-Allocated Memory CUDA Programming and Performance	6	3034	March 20, 2013

Low performance for CPU accessing page-locked memory?

Related topics