Help! First cudaMalloc takes 10 seconds!

xiansl · January 24, 2012, 11:19pm

Hi,

I have heard some discussions about cuda function startup cost, but I just found a strange 10 seconds of first cudaMalloc call.

Even in a simplest CUDA program, the first call of “cudaMalloc” takes almost 10 seconds. The server is Linux-x86-64 Intel Xeon system with 4 Tesla C2070 GPUs. The CUDA is release 4.0, V0.2.1221. From some posts here, I have tried to add -code=sm_13 or sm_20 in nvcc command lines, but none of them could reduce this overhead.

I have tried the same code in the second Linux x86_64 server, with 4 Tesla C2050 GPUs. It takes 3-4 seconds for the first cudaMalloc call.
But in the third Linux i386 server, which is a little bit old, with one Quadro 600 GPU. It only takes around 0.06 seconds for the first cudaMalloc call.

All the above testing are carried out more than 5 times. Any suggestions?

Thanks.

Lei

short · January 25, 2012, 2:32am

It would be interesting to repeat these experiments with CUDA Driver API. Please also post code you are testing for us to suggest something more.

xiansl · January 25, 2012, 5:53am

Thanks for reply.

The code is the simplest as follows:

#include <stdio.h>

#include <ctime>

int main(void) {

        double * a;

        clock_t t11 = clock();

        cudaMalloc((void**)&a, 2*sizeof(double));

        clock_t t12 = clock();

        printf(" cudaMalloc time: %f s\n", ((double)t12 - (double)t11)/CLOCKS_PER_SEC );

        return 0;

}

The results shown on three platforms:

Platform1(Linux-x86_64,Tesla C2070): 9.47 s

Platform2(Linux-x86_64,Tesla C2050): 1.16 s

Platform3(Linux-i386, Quadro 600): 0.07 s

short · January 25, 2012, 7:42pm

Try upgrading/downgrading the driver and see if that changes the startup time.

cmaster.matso · January 26, 2012, 12:03pm

FYI: In CUDA code events are usually used for any kind of performance testing (CUDA C Programmin Guide, p. 35, chapter 3.2.5.6).

Sarnath · January 28, 2012, 11:46am

10 seconds sounds insane…It should not be that bad…

kalman · February 8, 2012, 9:37pm

That is because the driver is not loaded and that time is required by the OS to load it.

To test it out write a programm doing a cudaMalloc and then spleep for hours, while that is sleeping (it keeps

the driver loaded) try out now how much a cudaMalloc takes.

L_F · February 8, 2012, 9:55pm

Probably some other processes were running in system when you measured the time. For me the first run of your cudaMalloc code on Linux86x64+TeslaC2070 gives 0.05sec

baoyun · February 11, 2012, 8:04am

Hi, I have similar problem. I want to some real time image processing about 30 frame/second.

But only copy 0.5 MBytes between CPU and GPU take me 0.06s. Is not that too slow. I am using Quadro 420.

Topic		Replies	Views
Long initialization time C1060 CUDA Programming and Performance	3	1194	August 6, 2009
cudaMalloc() time difference cudaMalloc() takes different times on (nearly) identical GPUs CUDA Programming and Performance	4	1529	October 17, 2009
cudaMalloc's taking different times CUDA Programming and Performance	3	1955	December 22, 2010
why did my first cudaMalloc() cost so much time? CUDA Programming and Performance	1	4962	November 12, 2009
Is first cudaMalloc() will take more time? then how much? CUDA Programming and Performance	1	2955	April 15, 2009
cudaMalloc taking 4 seconds CUDA Programming and Performance	4	850	November 23, 2011
Memory Allocation Time Takes too much time!! CUDA Programming and Performance	3	4640	August 28, 2009
slowness of first cudaMalloc (K40 vs K20) CUDA Programming and Performance	0	788	October 28, 2015
slowness of first cudaMalloc (K40 vs K20) CUDA Programming and Performance	0	687	October 28, 2015
slowness of first cudaMalloc (K40 vs K20) CUDA Programming and Performance	2	886	October 29, 2015

Help! First cudaMalloc takes 10 seconds!

Related topics