I just started to analyse one one of my cuda programs with NVIDIA Nsight and noticed that cudaMalloc is taking 4 seconds to complete. So I started to comment out the program to find exactly where to problem was and I found that even for just 2 line program it was taking ~~1-4 seconds, regardless of the number of times I use cudaMalloc, or the size of the allocated memory.
int main(void)
{
int test;
cudaMalloc((void**)&test,sizeof(int));
}
But then I noticed something that was even weirder, if i compiled the program 2-3 times cudaMalloc’s time shortens to about 0.1 - 0.4 seconds. But 0.1 seconds to just allocate an integer is a long time.
No, I didnt have cuInit(0) in my program because I have never seen that used in any example. So I added the line “cuInit(0);” to the beginning of my program and nothing changed.
I have also tried
"cudaSetDevice(0);
cudaThreadSynchronize();"
which shifts the 4 second overhead time to the cudaThreadSynchronize call.
I have been searching all over the internet trying to figure out a fix and still no luck. I know that there are other people with this same problem, because I have found threads about it but no real answers.
This article really summarizes what I’m experiencing in the “warm up” part
The guy reports that there are an initializing overhead for cuda of 3-5seconds. !!! Are you experiencing this? Because I’m calling shenanigans that there is a 3-5second overhead for everyone.