Execution time is different in Profiller and Console. why?


Below is the strutucre of my CUDA console application…

unsigned int timer;
CUT_SAFE_CALL( cutStartTimer(timer) );
CUT_SAFE_CALL( cutStartTimer(timer) );

GPUKernalCalls( );   //   It has total 26 Kernal calls + one Host to Device mem copy + one Device to Host mem copy.

CUT_SAFE_CALL( cutStopTimer(timer) );
float exeTime = cutGetTimerValue(timer);
CUT_SAFE_CALL( cutDeleteTimer( timer) );
printf("\n Project execution time: %2f ms \n\n", exeTime );


When we profile my application in CUDA2.2 profiller, it is giving that 46ms. ( sum of each individual Kernal function execution time. )
but when we get the time from the console application ( “exeTime” in the above code ) is 85ms for an input: 800 X 600 and output: 1920 X 1080 images.
So the differense is 39ms.

We may be say that this differense is because of cudaMalloc(), cudaMemcpy() , cudaThreadSync(), function calling overhead, Kernal Launching, Texture binding and unbinding…etc.
but all the above calls will not take much time.

So why this deffernse is? or is it acceptable? or am I missing anything in time calculations in my code?

Why do you say they will not take much time? cudaMalloc, cudaFree, and cudaMemcpy can take a significant amount of time.

Generally what could be the execution time for cudaMalloc, cudaFree, and cudaMemcpy. and also

cudaBindTexture, cudaUnbindTexture.

check this simple program…

It is giving 6.7 ms as execution time from the code and 3.5 ms in the profiller…

can you tell me where the remaining 3.2 ms spent in the code??

[codebox]#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cutil_inline.h>

// includes, kernels

#include <template_kernel.cu>

#define TIMEPARAMS unsigned int timer =0;\

						float				exeTime			=0.0f;

#define STARTTIME CUT_SAFE_CALL( cutCreateTimer(&timer) );\

						CUT_SAFE_CALL( cutStartTimer(timer)	);	

#define STOPTIME CUT_SAFE_CALL( cutStopTimer(timer) );\


						exeTime = cutGetTimerValue(timer);\

						CUT_SAFE_CALL( cutDeleteTimer(timer)	);\



void TestKernel1( int* array1, int limit )


int idx = __umul24(blockIdx.x,blockDim.x) + threadIdx.x;

if ( idx < limit )

	array1[idx] = idx;


int main( int argc, char** argv)


cutilDeviceInit(argc, argv);

int* dArray1=NULL;	

int* dArray2=NULL;	

int width = 2400;	

int height = 1800;	


cudaMalloc( (void**)&dArray1, sizeof(int)*width*height );	

cudaMemset( dArray1, 0, sizeof(int)*width*height );		

dim3 grid1( ((width*height)+255)/256,1,1);	

dim3 block1( 256,1,1);	


TestKernel1<<<grid1,block1>>>(dArray1, width*height);

cudaError error = cudaThreadSynchronize();


printf("\n TestKernel() exe time: %2f <ms> \n", exeTime );


cutilExit(argc, argv);   



I’ve noticed a similar overhead/lag/timesink/mystery - see my post here: http://forums.nvidia.com/index.php?showtopic=102808. I thought that it would be Windows-specific, I think I read somewhere, that Windows has a much higher overhead compared to Linux, unfortunately I cannot remember where - maybe that’s not true. Any hint would be appreciated.

Thanks && kind regards