cudaHostAlloc - very slow the first time

I have a speed problem with cudaHostAlloc…
Basically my cuda routine (let’s call it John) is :

1 cudaSetDeviceFlags(cudaDeviceMapHost);
2 cudaHostAlloc((void**)&A, size,cudaHostAllocMapped));
3 cudaHostAlloc((void**)&B, size,cudaHostAllocMapped));
4 …calculations…kernels…
5 cudaFreeHost(A);
6 cudaFreeHost(B);

Execution time of 2 : 2 seconds
Execution time of 3 : 0.0001 seconds
Execution time of 1-2-3-4-5-6 : 10 seconds

Why is the first allocation so slow ?

I tried to call John twice from the main : the second call is fast : 0.0001 seconds execution time for both 2 and 3.
What’s happening during the first call to cudaHostAlloc ??


First call to a cuda function such as cudaMalloc (and apparently cudaHostAlloc too) triggers the creation of the cuda context and potentially the wake up of the card too.
You can reduce this time by setting the persistent mode on on the card (nvidia-smi -pm 1), and avoid the pollution of your timings by triggering earlier the creation of the context with for example a call to “cudaMalloc(&prt, 0)” (where prt is a pointer to whatever).

Thank you for your answer !