I did some performance tests in which I allocated host memory (for my input data) with “cudaMallocHost()” (PINNED mode) and “malloc()” (PAGED mode). I tested different mem sizes (1 byte, 50, 100, 1000 and 10000 bytes) and it seems that the first time you allocate memory in your programm using “cudaMallocHost()” it lasts around 90 milliseconds.
This could be because of some startup overhead in the first usage of the graphics card. If you first call a ‘dummy’ kernel and after that start your timing, result will probably be better.
This is expected as pinning the memory means prohibiting relocation by any resource. Thus the operating system has to do quite some reorganization to ensure that. This certainly needs a multi-threading lock to avoid race conditions, so it is very expensive. The following allocations are faster because the OS probably took the chance to reserve more memory the first time the process requested some.
Also, it’s specified that CUDA is initialized the first time you call a CUDA function. So if that’s the only function you’re calling, it’s possible that you are including the CUDA initialization time in the time to call it the first time.