I am using cudaMalloc to allocate device memory and calculating the timings in the following way:
clock_t start = clock(); cutilSafeCall(cudaMalloc((void**) &sumImg_d,sizeof(int)*W*H )); //1st call tGPU=tGPU+ ((double)clock()-start)/CLOCKS_PER_SEC; cutilSafeCall(cudaMalloc((void**) &skinIntData_d,sizeof(int)*W*H)); //2nd call cutilSafeCall(cudaMalloc((void**) &out_d,sizeof(bool)*(W-dw)*(H-dh)));//3rd call cutilSafeCall(cudaMemcpy(sumImg_d,sumImg,numImgBytes,cudaMemcpyHostToDevice));// 4th call cutilSafeCall(cudaMemcpy(skinIntData_d,skinIntData,numImgBytes,cudaMemcpyHostToDevice));// 5th call
Now for every frame the first cudaMalloc is taking around 30ms while the 2nd one is taking only around 8ms ( btw these are the total timings and not just for one call of the fucntion). Size of both are the same still 1st one is taking more time is anything wrong? I heard that it could be because of context but not sure if that is the problem. Also the 3rd cudaMalloc is also taking around 35 ms I am not sure on what things this time depends? Is it because I am allocating a bool ??
I am using time.h to get the timings. I just call clock() function at the start and end of the functions to calculate the total time. Also is it necessary to use cudaThreadSynchronize() function to calculate timings in this case ( as far as i know these functions are not asynchronous so I think it should be ok if i dont use cudaThreadSynchronize()) . Kindly point out if this is not correct.