A short description of my problem is as follows:
I developed a function that calls a CUDA kernel. My function receives a pointer to the host data buffers (input and output of kernel), and has no control over the allocation of these buffers.
–> It is possible that the host data was allocated with either of malloc or cudaHostAlloc. My function is not specifically told which allocation method was used.
The question is: what is a feasible way for my function to figure out whether the host buffers are pinned/page-locked (cudaHostAlloc) or not (regular malloc)?
The reason I am asking is that if they are not page-locked, I would like to use cudaHostRegister() to make them (the buffers) so, which renders them amenable for streams.
I have tried three ways which have failed:
1- Always apply cudaHostRegister(): this way breaks if the host buffers are already pinned
2- Run cudaPointerGetAttributes(), and if the return error is cudaSuccess, then the buffers are already pinned, nothing to do; else if cudaErrorInvalidValue, apply cudaHostRegister : for some reason this way results in the kernel execution returning an error
3- Run cudaHostGetFlags(), and if return is not a success, then apply cudaHostRegister : same behavior as 2-.
In the case of 2- and 3-, the error is “invalid argumentn”
Note that my code currently is not using streams, rather always calls cudaMemcpy() for the entire host buffers. If I do not use any of the three above ways, my code runs to completion, regardless of whether the host buffer is pinned or not.
Any advice? Many thanks in advance.
I am not quite sure but I seem to recall that cudaMemcpyAsync() automatically falls back to cudaMemcpy() if the host buffer is not pinned. If confirmed, wouldn’t this provide sufficient robustness for code that uses streams in conjunction with cudaMemcpyAsync()?
In principle, as far as I can tell, yes – cudaMemcpuAsync() would provide enough robustness. Will give it a shot; thanks!
One caveat is that without streams, I am getting a performance boost out of cudaHostRegister() (~1.5X speedup). In essence, it seems like using cudaHostRegister() does provide some benefits.
Note that reverting to method 3- above, the program runs to completion, and I get the expected results back. To do so, I am simply not checking the return value of cudaGetLastError() after the kernel call. Any advice on this matter too would help.
transfers with pinned memory are often faster than with non-pinned memory. counteracting this, the registration time usually outweighs any time benefit from a single transfer. But if the buffer is used repeatedly, registering it will give a net improvement in overall execution time for these calls (alloc/registration + cudaMemcpy).
this question was cross-posted here:
and an explanation is given there for the error behavior.
Looking at just the time for the transfers themselves, shouldn’t transfers from pinned memory always be faster?
With pinned memory, the GPU’s DMA engine can pull the data straight from the host’s system memory based on the known physical address, for as large a block size as has been pinned. Without pinning the host buffer, the contents of the host memory is first copied (at system memory transfer rates) to a pinned buffer of limited size allocated by the driver, and DMAed from there. In addition to the additional copy operation in system memory, large transfers may be broken into multiple smaller ones to account for the limited buffer size.