cudaHostAllocMapped

DSCH · October 14, 2009, 1:08pm

Hey!

Can anybody tell me what is the differenz between

cudaHostAlloc (void **ptr, size_t size, cudaHostAllocMapped)

and

cudaHostAlloc (void **ptr, size_t size, cudaHostAllocDefault)?

I don’t understand the advantage of mapped page lock memory in opposite to normal page lock memory. In both cases the device can access the host memory direct. Which advantage has an address mapping? And which methode is faster?

best regards!

MisterAnderson42 · October 14, 2009, 3:03pm

Mapped memory can be accessed directly from within kernel without needing to cudaMemcpy the region back and forth. You need to get a different pointer to do this on the device, though, through a cuda API call that I don’t remember the name of at the moment.

DSCH · October 14, 2009, 3:25pm

Thanks for reply!

Ok, so cudaHostAlloc (…, cudaHostAllocDefault) works in the same way as cudaMalloc(…), with the differenz that cudaHostAlloc (…, cudaHostAllocDefault) uses no paging. Right? But why is using page locked memory faster?

YDD · October 14, 2009, 6:12pm

Page-locked memory is faster because the GPU can only DMA to page-locked memory (if it’s pageable, then the pages might get swapped out mid-transfer causing Bad Things to happen). This means that cudaMemcpy internally does a copy to a page-locked buffer, and then has the card DMA to that. This extra host-side copy slows things down. If you use cudaMallocHost or cudaHostAlloc, then the driver knows that the buffer is pinned and can skip the copy to the internal buffer.

DSCH · October 15, 2009, 8:48am

Thanks for reply!

Ah ok! But that brings me to a other questions:

Why do I have to use cudaMallocHost, cudaHostAlloc for asynchronous data transfers, if cudaMemcpy, cudaMalloc use page-locked-memory (by the internal page-locked buffer), too?
The page-locked buffer is an area in the working space?

regards!

YDD · October 15, 2009, 3:24pm

To answer question 1, it’s probably a matter of complexity and expectation. People who want to do asynchronous transfers are obviously after high performance, and hence are probably already using page-locked RAM. With that existing effort, why complicate the driver with managing multiple internal buffers? For question 2… what do you mean by ‘in the working space?’ Page-locked memory is accessed just like normal memory. It’s just that the OS knows not to swap those pages out.

Topic		Replies	Views
Difference between cudaMallocManaged and cudaMallocHost CUDA Programming and Performance cuda	3	12995	March 30, 2022
Low performance for CPU accessing page-locked memory? CUDA Programming and Performance	3	610	March 7, 2019
Page-Locked Host Memory without using cudaHostAlloc() CUDA Programming and Performance	1	1015	February 17, 2011
page-locked memory CUDA Programming and Performance	4	11327	November 10, 2008
Does the page-lock memory by cudaHostRegister slow than cudaMallocHost? CUDA Programming and Performance	9	778	June 30, 2023
Pinned memory size problem CUDA Programming and Performance	4	3922	December 11, 2009
Why are transfers faster for cudaMallocHost? Even after I page lock "regular" memory. CUDA Programming and Performance	2	6543	January 11, 2008
Using async memcopy without using cudaMallocHost/cudaHostAlloc? CUDA Programming and Performance	3	16491	March 30, 2010
cudahostalloc vs memcpy tradeoff CUDA Programming and Performance	1	1388	November 24, 2014
Is cudaHostAlloc() fast? CUDA Programming and Performance	5	615	March 28, 2024

cudaHostAllocMapped

Related topics