Hello,
Below is the strutucre of my CUDA console application…
main()
{
unsigned int timer;
CUT_SAFE_CALL( cutStartTimer(timer) );
CUT_SAFE_CALL( cutStartTimer(timer) );
GPUKernalCalls( ); // It has total 26 Kernal calls + one Host to Device mem copy + one Device to Host mem copy.
CUT_SAFE_CALL( cutStopTimer(timer) );
float exeTime = cutGetTimerValue(timer);
CUT_SAFE_CALL( cutDeleteTimer( timer) );
printf("\n Project execution time: %2f ms \n\n", exeTime );
}
When we profile my application in CUDA2.2 profiller, it is giving that 46ms. ( sum of each individual Kernal function execution time. )
but when we get the time from the console application ( “exeTime” in the above code ) is 85ms for an input: 800 X 600 and output: 1920 X 1080 images.
So the differense is 39ms.
We may be say that this differense is because of cudaMalloc(), cudaMemcpy() , cudaThreadSync(), function calling overhead, Kernal Launching, Texture binding and unbinding…etc.
but all the above calls will not take much time.
So why this deffernse is? or is it acceptable? or am I missing anything in time calculations in my code?