Help! First cudaMalloc takes 10 seconds!


I have heard some discussions about cuda function startup cost, but I just found a strange 10 seconds of first cudaMalloc call.

Even in a simplest CUDA program, the first call of “cudaMalloc” takes almost 10 seconds. The server is Linux-x86-64 Intel Xeon system with 4 Tesla C2070 GPUs. The CUDA is release 4.0, V0.2.1221. From some posts here, I have tried to add -code=sm_13 or sm_20 in nvcc command lines, but none of them could reduce this overhead.

I have tried the same code in the second Linux x86_64 server, with 4 Tesla C2050 GPUs. It takes 3-4 seconds for the first cudaMalloc call.
But in the third Linux i386 server, which is a little bit old, with one Quadro 600 GPU. It only takes around 0.06 seconds for the first cudaMalloc call.

All the above testing are carried out more than 5 times. Any suggestions?



It would be interesting to repeat these experiments with CUDA Driver API. Please also post code you are testing for us to suggest something more.

Thanks for reply.

The code is the simplest as follows:

#include <stdio.h>

#include <ctime>

int main(void) {

        double * a;

        clock_t t11 = clock();

        cudaMalloc((void**)&a, 2*sizeof(double));

        clock_t t12 = clock();

        printf(" cudaMalloc time: %f s\n", ((double)t12 - (double)t11)/CLOCKS_PER_SEC );

        return 0;


The results shown on three platforms:

Platform1(Linux-x86_64,Tesla C2070): 9.47 s

Platform2(Linux-x86_64,Tesla C2050): 1.16 s

Platform3(Linux-i386, Quadro 600): 0.07 s

Try upgrading/downgrading the driver and see if that changes the startup time.

FYI: In CUDA code events are usually used for any kind of performance testing (CUDA C Programmin Guide, p. 35, chapter

10 seconds sounds insane…It should not be that bad…

That is because the driver is not loaded and that time is required by the OS to load it.

To test it out write a programm doing a cudaMalloc and then spleep for hours, while that is sleeping (it keeps

the driver loaded) try out now how much a cudaMalloc takes.

Probably some other processes were running in system when you measured the time. For me the first run of your cudaMalloc code on Linux86x64+TeslaC2070 gives 0.05sec

Hi, I have similar problem. I want to some real time image processing about 30 frame/second.

But only copy 0.5 MBytes between CPU and GPU take me 0.06s. Is not that too slow. I am using Quadro 420.