cudaMalloc taking 4 seconds

Hello,

I just started to analyse one one of my cuda programs with NVIDIA Nsight and noticed that cudaMalloc is taking 4 seconds to complete. So I started to comment out the program to find exactly where to problem was and I found that even for just 2 line program it was taking ~~1-4 seconds, regardless of the number of times I use cudaMalloc, or the size of the allocated memory.

int main(void)
{
int test;
cudaMalloc((void**)&test,sizeof(int));

}

But then I noticed something that was even weirder, if i compiled the program 2-3 times cudaMalloc’s time shortens to about 0.1 - 0.4 seconds. But 0.1 seconds to just allocate an integer is a long time.

If anyone has any advice, please share.

Thanks!

Have you initialized the device (e.g. cuInit(0)) before allocating?

No, I didnt have cuInit(0) in my program because I have never seen that used in any example. So I added the line “cuInit(0);” to the beginning of my program and nothing changed.

I have also tried

"cudaSetDevice(0);

cudaThreadSynchronize();"

which shifts the 4 second overhead time to the cudaThreadSynchronize call.

I have been searching all over the internet trying to figure out a fix and still no luck. I know that there are other people with this same problem, because I have found threads about it but no real answers.

This article really summarizes what I’m experiencing in the “warm up” part

http://ivanlife.wordpress.com/2011/05/09/how-to-make-good-measurements-in-nvida-cuda/

The guy reports that there are an initializing overhead for cuda of 3-5seconds. !!! Are you experiencing this? Because I’m calling shenanigans that there is a 3-5second overhead for everyone.

Please help me out,

Thanks!

what OS and what GPU?

OS: Windows 7 Pro. 64bit

GPU: EVGA GTX 560TI 2GB

The device query can be found in the attached file
DQ.PNG