I have heard some discussions about cuda function startup cost, but I just found a strange 10 seconds of first cudaMalloc call.
Even in a simplest CUDA program, the first call of “cudaMalloc” takes almost 10 seconds. The server is Linux-x86-64 Intel Xeon system with 4 Tesla C2070 GPUs. The CUDA is release 4.0, V0.2.1221. From some posts here, I have tried to add -code=sm_13 or sm_20 in nvcc command lines, but none of them could reduce this overhead.
I have tried the same code in the second Linux x86_64 server, with 4 Tesla C2050 GPUs. It takes 3-4 seconds for the first cudaMalloc call.
But in the third Linux i386 server, which is a little bit old, with one Quadro 600 GPU. It only takes around 0.06 seconds for the first cudaMalloc call.
All the above testing are carried out more than 5 times. Any suggestions?
Probably some other processes were running in system when you measured the time. For me the first run of your cudaMalloc code on Linux86x64+TeslaC2070 gives 0.05sec