cudaHostRegister/Unregister vs Host Memcpy to Pagelocked

Tiomat · November 22, 2012, 4:29pm

Is there any noticeable overhead to calling these functions to pagelock normally malloced memory? I have an undefined (but lots) of images, the number of which would make pagelocking each of them dangerous. I then need to access these in an essentially random order, so was wondering whether using HostRegister/Unregister before performing an async copy would be better (or not worse) than having to do a memcpy on the Host from pageable to pagelocked memory before performing the cudaMemcpyAsync.

Cheers,
Tiomat

cmaster.matso · November 22, 2012, 8:52pm

As far as I know page-locked memory can make copies be performed faster then from normal host memory. According to the “CUDA by Example” book, using pinned memory can lead to even two times faster copying. But is good only for buffers that are often released, as kernel input or output.

From NVIDIA CUDA Reference Manual:
“[…]Page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to register staging areas for data exchange between host and device.[…]”

But in Your case, why don’t use textures instead?

MK

Tiomat · November 23, 2012, 9:24am

In a sense the buffers are often released, as you put it, as it is an iterative solution that takes a single one of these images in each iteration. The problem is the sheer amount of images that need to be pulled in, which can require far more memory than available on the actual GPU (and adds an artificial limit which is not desirable).

As such the plan is to do an asynchronous memory copy onto the GPU of the next frame’s image during the current frame’s calculation. To do this requires pinned memory, and so then the question of how best to do this has arisen. There is always the option of allocate some pinned memory on the Host and copy the next frame’s image into that pinned memory before the copy, but I was curious whether I could sidestep this by using the HostRegister/Unregister on the existing malloced memory to avoid the memcpy.

To know whether this is viable, I was interested in knowing whether the HostRegister/Unregister have overhead associated with them (other than potentially bringing the page back into physical memory) more than a few microseconds. If, for example, under the hood the driver simply does its own memcpy into some pinned memory in driver space then obviously that then does not provide any real benefit.

HannesF99 · November 26, 2012, 8:15am

I would be also interested to know about whether there is some overhead with using the cudaHostRegister fn. and what it does ‘under the hood’. Would be nice if someone from NVIDIA could give some information on that, thx in advance.

Topic		Replies	Views
Does the page-lock memory by cudaHostRegister slow than cudaMallocHost? CUDA Programming and Performance	9	968	June 30, 2023
transfer from pageable host memory to page-locked host memory? CUDA Programming and Performance	3	1142	June 1, 2012
Using async memcopy without using cudaMallocHost/cudaHostAlloc? CUDA Programming and Performance	3	16581	March 30, 2010
cudaHostAllocMapped CUDA Programming and Performance	5	8275	October 15, 2009
Find out if host buffer is pinned (page-locked) CUDA Programming and Performance	4	2771	March 4, 2015
Poor performance cudaHostUnregister CUDA Programming and Performance	6	7199	September 20, 2011
cudaHostRegister returns cudaErrorInvalidValue CUDA Programming and Performance	14	2981	January 28, 2021
Unable get over 512MB of page-locked memory with cudaHostRegister or cudaMallocHost... CUDA Programming and Performance	3	3095	July 2, 2012
Async transfers with non-cuda host memory using page-locked memory not cuda memory CUDA Programming and Performance	5	11719	July 4, 2008
Why, how and when to use page locked host memory CUDA Programming and Performance	1	3266	July 8, 2009

cudaHostRegister/Unregister vs Host Memcpy to Pagelocked

Related topics