cuda startup slow

I found that the first call to cuda library is very slow. I have the code like this

cudaMallocPitch((void**)&d_limg, &pitch, wl*sizeof(float), hl);

float timerPitch = cutGetTimerValue(timer4);
printf("Malloc pitch time: %0.2f ms\n", timerPitch);

cudaMallocPitch((void**)&d_prelimg, &pitch, wl*sizeof(float), hl);
cudaMallocPitch((void**)&d_simg, &spitch, w*sizeof(float), h);

And it takes me 500ms to allocate the first piece memory, however the next two calls don’t have this problem?

Any suggestion?


Anyone can help?
I am using 8600GT and the first cudaMallocPitch is the first call to cuda library

After some googling, it seems that the first call to cuda library will cause some initialization of the context.

But what’s the scope for this initializtion, same as the host process? or the host thread, if it is multi-thread?


Yes. It is only a one time cost at the beginning of the process/thread.

You may be able to significantly reduce this initialization time by specifying to nvcc the gpu on which your kernel is to be executed. You can do so by adding -code sm_13 (or whatever your gpu is) to nvcc’s command line. You may have a closer look at e.g. page 16 of nvcc_2.1.pdf in the doc directory beside the bin directory of nvcc.

If you don’t specify -code, apparently (sth. like) ptxas will be invoked when executing the first cuda function, in order to compile and optimize the ptx embedded in your executable for the current gpu. (I just figured this out for a rather large kernel, where omitting -code leads to abortion of the executable after about 3.5 minutes (spent in the first cuda function). With -code the compilation takes about 4.5 minutes but the executable initializes within a few seconds…)