Getting different time for kernel execution.

Hi Everyone,

I want calculate each individual kernel execution time. So, I am calculating like follwing code sample :

unsigned int	funTime  = 0;

	 float			  time	   =0.0f;

	 CUT_SAFE_CALL( cutCreateTimer(&funTime) );

	 CUT_SAFE_CALL( cutStartTimer(funTime) );


	  MyKernel <<<12, 256>>>( float inputArray,  float outputArray );


	  CUT_SAFE_CALL( cutStopTimer(funTime) );

	  time = cutGetTimerValue(funTime);

	 CUT_SAFE_CALL( cutDeleteTimer(funTime) );

	 printf("\MyKernel()  execution time: %2f (ms) \n", time );

I am getting

But CUDA profiler 2.3 gives the execution time

Why am I getting this difference in execution time ?

You should not be using CUTIL, it is meant only for SDK developers to create quick and clean examples.

Given that, I’m guessing that the profiler uses actual hardware timers (cuda events) and the cutil timers use lower resolution software timers. Also, I’m going to assume cutStopTimer does a cudaThreadSynchronize, otherwords you would just be timing the kernel launch time as they are asynchronous.

Could you tell best way to calculate time of kernel from the code?

if am not wrong cutilStopTimer, does not call any threadsynch code.


//! Stop the time with name \a name. Does not reset.

//! @param name  name of the timer to stop



cutStopTimer( const unsigned int name) 


	CUTBoolean retval = CUTTrue;

#ifdef _DEBUG



		StopWatch::get( name).stop();


	catch( const std::exception& ex) 


		std::cerr << "WARNING: " << ex.what() << std::endl;

		retval = CUTFalse;



	StopWatch::get( name).stop();


	return retval;


the answer is i dont know. i know it does not help a lot but it’ll save you time asking me this question.

please don’t kick me toot hard for this >.<

Yes, you are absolutely corrct here. I have checked it .

Check out section “ Event” in the programming guide:

The following code sample creates two events:

cudaEvent_t start, stop;



These events can be used to time the code sample of the previous section the following way:

cudaEventRecord(start, 0);

for (int i = 0; i < 2; ++i)

cudaMemcpyAsync(inputDev + i * size, inputHost + i * size, size, cudaMemcpyHostToDevice, stream[i]);

for (int i = 0; i < 2; ++i)

myKernel<<<100, 512, 0, stream[i]>>>(outputDev + i * size, inputDev + i * size, size);

for (int i = 0; i < 2; ++i)

cudaMemcpyAsync(outputHost + i * size, outputDev + i * size, size, cudaMemcpyDeviceToHost, stream[i]);

cudaEventRecord(stop, 0);


float elapsedTime;

cudaEventElapsedTime(&elapsedTime, start, stop);

They are destroyed this way:



It looks like the nbody SDK example project uses these event timers:

void runBenchmark(int iterations)


	// once without timing to prime the GPU


	cutilSafeCall(cudaEventRecord(startEvent, 0));

	for (int i = 0; i < iterations; ++i)




	cutilSafeCall(cudaEventRecord(stopEvent, 0));  


	float milliseconds = 0;

	cutilSafeCall( cudaEventElapsedTime(&milliseconds, startEvent, stopEvent));

	double interactionsPerSecond = 0;

	double gflops = 0;

	computePerfStats(interactionsPerSecond, gflops, milliseconds, iterations);


	printf("%d bodies, total time for %d iterations: %0.3f ms\n", 

		   numBodies, iterations, milliseconds);

	printf("= %0.3f billion interactions per second\n", interactionsPerSecond);

	printf("= %0.3f GFLOP/s at %d flops per interaction\n", gflops, 20);   


Keep in mind however, these are just device timers and will only time what’s going on with the hardware…perfect if you are just trying to time a kernel.

You are correct, sorry about that.

I missed the indented “cudaThreadSynchronize();” after the call in the original post. I was assuming that since it took 15ms as opposed to the very short amount of time required to launch the kernel, something else (cutilStopTimer) must be doing the sych.