cudaHostRegister/Unregister vs Host Memcpy to Pagelocked

Is there any noticeable overhead to calling these functions to pagelock normally malloced memory? I have an undefined (but lots) of images, the number of which would make pagelocking each of them dangerous. I then need to access these in an essentially random order, so was wondering whether using HostRegister/Unregister before performing an async copy would be better (or not worse) than having to do a memcpy on the Host from pageable to pagelocked memory before performing the cudaMemcpyAsync.


As far as I know page-locked memory can make copies be performed faster then from normal host memory. According to the “CUDA by Example” book, using pinned memory can lead to even two times faster copying. But is good only for buffers that are often released, as kernel input or output.

From NVIDIA CUDA Reference Manual:
“[…]Page-locking excessive amounts of memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to register staging areas for data exchange between host and device.[…]”

But in Your case, why don’t use textures instead?


In a sense the buffers are often released, as you put it, as it is an iterative solution that takes a single one of these images in each iteration. The problem is the sheer amount of images that need to be pulled in, which can require far more memory than available on the actual GPU (and adds an artificial limit which is not desirable).

As such the plan is to do an asynchronous memory copy onto the GPU of the next frame’s image during the current frame’s calculation. To do this requires pinned memory, and so then the question of how best to do this has arisen. There is always the option of allocate some pinned memory on the Host and copy the next frame’s image into that pinned memory before the copy, but I was curious whether I could sidestep this by using the HostRegister/Unregister on the existing malloced memory to avoid the memcpy.

To know whether this is viable, I was interested in knowing whether the HostRegister/Unregister have overhead associated with them (other than potentially bringing the page back into physical memory) more than a few microseconds. If, for example, under the hood the driver simply does its own memcpy into some pinned memory in driver space then obviously that then does not provide any real benefit.

I would be also interested to know about whether there is some overhead with using the cudaHostRegister fn. and what it does ‘under the hood’. Would be nice if someone from NVIDIA could give some information on that, thx in advance.