cudaMallocHost() vs. malloc() 1st "cudaMallocHost()" lasts ~90ms!!


I did some performance tests in which I allocated host memory (for my input data) with “cudaMallocHost()” (PINNED mode) and “malloc()” (PAGED mode). I tested different mem sizes (1 byte, 50, 100, 1000 and 10000 bytes) and it seems that the first time you allocate memory in your programm using “cudaMallocHost()” it lasts around 90 milliseconds.

here are the exact times:

Memsize  PINNED1  PINNED2  PINNED3  PAGED  cudaMalloc (Device)    

1          98.243408  0.117040  0.116175  0.001571  0.002575    

50          84.392342  0.115694  0.114750  0.001632  0.002571    

100          85.650665  0.116036  0.114686  0.001763  0.002500    

1000          95.233505  0.116254  0.115002  0.004049  0.002560    

10000          86.180138  0.122314  0.119716  0.007654  0.091254

PINNED1 is the 1st time I allocate PINNED mem, PINNED2 the 2nd and so on. Can someone reproduce this or can a Nvidia fellow confirm this?

thanks in advance and best regards,


This could be because of some startup overhead in the first usage of the graphics card. If you first call a ‘dummy’ kernel and after that start your timing, result will probably be better.

Sure, but it seems that this is “cudaMallocHost()” specific. The plain old “malloc()” has no overhead, right?

This is expected as pinning the memory means prohibiting relocation by any resource. Thus the operating system has to do quite some reorganization to ensure that. This certainly needs a multi-threading lock to avoid race conditions, so it is very expensive. The following allocations are faster because the OS probably took the chance to reserve more memory the first time the process requested some.


Also, it’s specified that CUDA is initialized the first time you call a CUDA function. So if that’s the only function you’re calling, it’s possible that you are including the CUDA initialization time in the time to call it the first time.

thanks to all of you for your answers!

have a nice day,