cudaMalloc's taking different times

Dear all,

I am using cudaMalloc to allocate device memory and calculating the timings in the following way:

clock_t start = clock();

        cutilSafeCall(cudaMalloc((void**) &sumImg_d,sizeof(int)*W*H )); //1st call

	tGPU=tGPU+ ((double)clock()-start)/CLOCKS_PER_SEC;

	cutilSafeCall(cudaMalloc((void**) &skinIntData_d,sizeof(int)*W*H)); //2nd call


        cutilSafeCall(cudaMalloc((void**) &out_d,sizeof(bool)*(W-dw)*(H-dh)));//3rd call



	cutilSafeCall(cudaMemcpy(sumImg_d,sumImg,numImgBytes,cudaMemcpyHostToDevice));// 4th call


	cutilSafeCall(cudaMemcpy(skinIntData_d,skinIntData,numImgBytes,cudaMemcpyHostToDevice));// 5th call

Now for every frame the first cudaMalloc is taking around 30ms while the 2nd one is taking only around 8ms ( btw these are the total timings and not just for one call of the fucntion). Size of both are the same still 1st one is taking more time is anything wrong? I heard that it could be because of context but not sure if that is the problem. Also the 3rd cudaMalloc is also taking around 35 ms I am not sure on what things this time depends? Is it because I am allocating a bool ??

I am using time.h to get the timings. I just call clock() function at the start and end of the functions to calculate the total time. Also is it necessary to use cudaThreadSynchronize() function to calculate timings in this case ( as far as i know these functions are not asynchronous so I think it should be ok if i dont use cudaThreadSynchronize()) . Kindly point out if this is not correct.


I’m not sure I’d trust [font=“Courier New”]clock()[/font] to have that sort of time resolution - for ms time resolution, use [font=“Courier New”]gettimeofday[/font] - if you’re still using cutil, then the cutil timers should be a convenient wrapper. In general though, the first [font=“Courier New”]cuda*[/font] call will be slow, since the driver has to initialise a GPU context (the execptions are the device query routines).

Thanks for the reply.

I also tried using GPU timers ie cutCreateTimer etc. but timings are still the same. I am working on a video and I call GPU kernel every frame. So if every time I call the first malloc is taking a lot of time then the performance of my implementation will be very bad. Is there any way to avoid this??


call malloc once at start of a program.