cudaMalloc taking 4 seconds


I just started to analyse one one of my cuda programs with NVIDIA Nsight and noticed that cudaMalloc is taking 4 seconds to complete. So I started to comment out the program to find exactly where to problem was and I found that even for just 2 line program it was taking ~~1-4 seconds, regardless of the number of times I use cudaMalloc, or the size of the allocated memory.

int main(void)
int test;


But then I noticed something that was even weirder, if i compiled the program 2-3 times cudaMalloc’s time shortens to about 0.1 - 0.4 seconds. But 0.1 seconds to just allocate an integer is a long time.

If anyone has any advice, please share.


Have you initialized the device (e.g. cuInit(0)) before allocating?

No, I didnt have cuInit(0) in my program because I have never seen that used in any example. So I added the line “cuInit(0);” to the beginning of my program and nothing changed.

I have also tried



which shifts the 4 second overhead time to the cudaThreadSynchronize call.

I have been searching all over the internet trying to figure out a fix and still no luck. I know that there are other people with this same problem, because I have found threads about it but no real answers.

This article really summarizes what I’m experiencing in the “warm up” part

The guy reports that there are an initializing overhead for cuda of 3-5seconds. !!! Are you experiencing this? Because I’m calling shenanigans that there is a 3-5second overhead for everyone.

Please help me out,


what OS and what GPU?

OS: Windows 7 Pro. 64bit


The device query can be found in the attached file