cudaHostAllocMapped

Hey!

Can anybody tell me what is the differenz between

cudaHostAlloc (void **ptr, size_t size, cudaHostAllocMapped)

and

cudaHostAlloc (void **ptr, size_t size, cudaHostAllocDefault)?

I don’t understand the advantage of mapped page lock memory in opposite to normal page lock memory. In both cases the device can access the host memory direct. Which advantage has an address mapping? And which methode is faster?

best regards!

1 Like

Mapped memory can be accessed directly from within kernel without needing to cudaMemcpy the region back and forth. You need to get a different pointer to do this on the device, though, through a cuda API call that I don’t remember the name of at the moment.

Thanks for reply!

Ok, so cudaHostAlloc (…, cudaHostAllocDefault) works in the same way as cudaMalloc(…), with the differenz that cudaHostAlloc (…, cudaHostAllocDefault) uses no paging. Right? But why is using page locked memory faster?

Page-locked memory is faster because the GPU can only DMA to page-locked memory (if it’s pageable, then the pages might get swapped out mid-transfer causing Bad Things to happen). This means that cudaMemcpy internally does a copy to a page-locked buffer, and then has the card DMA to that. This extra host-side copy slows things down. If you use cudaMallocHost or cudaHostAlloc, then the driver knows that the buffer is pinned and can skip the copy to the internal buffer.

Thanks for reply!

Ah ok! But that brings me to a other questions:

  1. Why do I have to use cudaMallocHost, cudaHostAlloc for asynchronous data transfers, if cudaMemcpy, cudaMalloc use page-locked-memory (by the internal page-locked buffer), too?

  2. The page-locked buffer is an area in the working space?

regards!

To answer question 1, it’s probably a matter of complexity and expectation. People who want to do asynchronous transfers are obviously after high performance, and hence are probably already using page-locked RAM. With that existing effort, why complicate the driver with managing multiple internal buffers? For question 2… what do you mean by ‘in the working space?’ Page-locked memory is accessed just like normal memory. It’s just that the OS knows not to swap those pages out.