Memory Allocation Time Takes too much time!!

I use cudaMalloc to allocate memory and I time the function with the cutResetTimer, cutStartTimer and cutStopTime.

In my kernel the time measured is in the order of 200 to 300 ms. However, when I time a cudaMalloc in another kernel (in particular in the asyncAPI of the SDK) the returned time is around 0.36 ms. I cannot find out what is the difference between the two, and why do I get such a slow cudaMalloc in my kernel. What is the expected time for the cudaMalloc (is it around 200 ms or around 0.5 ms)?

Something that may be helpful: I compile my code and I use it as a shared library called by another function. Could that be the reason of the delay?

Thanks!!

I also tried my code as a static library, but I still get the same timing results.

it’s slow the first time because it’s allocating a context

but why does the cudaMalloc in the asyncAPI takes only a few milliseconds?