In our worklflow, we are pinning certain images via a call of ‘cudaHostRegister’ ( CUDA Runtime API :: CUDA Toolkit Documentation )
Due to the multi-threaded and parallel nature of our software framework, the same image I will be likely pinned several times via the ‘cudaHostRegister’ fn. call. Is that a problem ?
So is it a problem if a cudaHostRegister call - with the same parameters (ptr, size, flags) - is done multiple times ? I would expect that the second call would be a null-operation internally, as the memory buffer has been already ‘pinned’ to page-locked memory.
For the un-registering via ‘cudaHostUnregister’, we can guarantee that it is done only once (on destruction of the image).
Note: regarding practical limits for pinned memory, see https://devtalk.nvidia.com/default/topic/769383/arbitrary-device-limit-on-pinned-host-memory/ and https://devtalk.nvidia.com/default/topic/883675/cuda-programming-and-performance/pinned-memory-limit/
I think so. Did you try it?
This is the behavior I observe on CUDA 10.1:
$ cat t458.cu
#include <stdio.h>
int main(){
int *data = (int *)malloc(1000);
cudaError_t err = cudaHostRegister(data, 1000, cudaHostRegisterDefault);
if (err != cudaSuccess) {printf("err1 = %s\n", cudaGetErrorString(err));}
err = cudaHostRegister(data, 1000, cudaHostRegisterDefault);
if (err != cudaSuccess) {printf("err2 = %s\n", cudaGetErrorString(err));}
err = cudaGetLastError();
if (err != cudaSuccess) {printf("err3 = %s\n", cudaGetErrorString(err));}
err = cudaGetLastError();
if (err != cudaSuccess) {printf("err4 = %s\n", cudaGetErrorString(err));}
return 0;
}
$ nvcc -o t458 t458.cu
$ ./t458
err2 = part or all of the requested memory range is already mapped
err3 = part or all of the requested memory range is already mapped
$
Based on that I would conclude that “it is a problem” because it did not return cudaSuccess.
The problem appears to be “non-sticky” which means that the underlying CUDA context is not corrupted. A similar non-sticky error occurs for example if you try to allocate too much memory. The underlying CUDA context is still usable.
The non-corrupted CUDA context state is a strong indication that the extra operation is a no-op, albeit returning an error.
No, I did not implement it so far, just laid out the idea for this optimization (to pin certain ‘critical’ image buffer on the CPU which will be transfered often to/from GPU memory.
I added some text to my previous entry. It is obviously not correct behavior from a usage standpoint, and the indication of this is that a cuda error code is returned if you try to do it. From that perspective I would say “it is a problem”. However the CUDA runtime API is also telling me that this is a non-sticky error, which means that the underlying CUDA context is not corrupted, which basically means that the previous pinning of that memory was not perturbed by the second call; it was basically a no-op.
I doubt you’ll get any formal statement of that in the documentation unless you file a bug report asking for a clarification. However if you are willing to tolerate (or better, trap and handle) a possible error code, the second operation does appear to be a no-op and “safe” from that viewpoint. If it were fundamentally damaging or leading to some unexpected state in the underlying CUDA context, then it should be a sticky error. The fact that it is not a sticky error means that the CUDA context is still usable and therefore the state of it is logically inferable by the programmer at runtime.
Note that my test case is not multi-threaded. I wouldn’t expect any difference with a multi-threaded test case but I have not tried that. Please refer to my previous statement about formal statements. This is not a formal statement. If you require a formal statement in the documentation, I suggest filing a bug.
Thanks for the thorough (and fast) information !
Seems that I have to implement it in a way so that ‘cudaHostRegister’ is called only once for a certain buffer.
Makes the implementation more complicated, but it is the safe way.
Is it safe to call the ‘cudaHostUnregister(p)’ for a pointer ‘p’ for which the corresponding CPU buffer has been already de-allocated (freed) ?