I use cudaMalloc to allocate memory and I time the function with the cutResetTimer, cutStartTimer and cutStopTime.
In my kernel the time measured is in the order of 200 to 300 ms. However, when I time a cudaMalloc in another kernel (in particular in the asyncAPI of the SDK) the returned time is around 0.36 ms. I cannot find out what is the difference between the two, and why do I get such a slow cudaMalloc in my kernel. What is the expected time for the cudaMalloc (is it around 200 ms or around 0.5 ms)?
Something that may be helpful: I compile my code and I use it as a shared library called by another function. Could that be the reason of the delay?