Getting different time for kernel execution.

CUDAkk · July 30, 2009, 12:09pm

Hi Everyone,

I want calculate each individual kernel execution time. So, I am calculating like follwing code sample :

unsigned int	funTime  = 0;

	 float			  time	   =0.0f;

	 CUT_SAFE_CALL( cutCreateTimer(&funTime) );

	 CUT_SAFE_CALL( cutStartTimer(funTime) );

	 

	  MyKernel <<<12, 256>>>( float inputArray,  float outputArray );

				  cudaThreadSynchronize();

	  CUT_SAFE_CALL( cutStopTimer(funTime) );

	  time = cutGetTimerValue(funTime);

	 CUT_SAFE_CALL( cutDeleteTimer(funTime) );

	 printf("\MyKernel()  execution time: %2f (ms) \n", time );

I am getting

But CUDA profiler 2.3 gives the execution time

Why am I getting this difference in execution time ?

jph4599 · July 30, 2009, 12:22pm

You should not be using CUTIL, it is meant only for SDK developers to create quick and clean examples.

Given that, I’m guessing that the profiler uses actual hardware timers (cuda events) and the cutil timers use lower resolution software timers. Also, I’m going to assume cutStopTimer does a cudaThreadSynchronize, otherwords you would just be timing the kernel launch time as they are asynchronous.

CUDAkk · July 30, 2009, 12:39pm

Could you tell best way to calculate time of kernel from the code?

Keither · July 30, 2009, 12:58pm

if am not wrong cutilStopTimer, does not call any threadsynch code.

////////////////////////////////////////////////////////////////////////////////

//! Stop the time with name \a name. Does not reset.

//! @param name  name of the timer to stop

////////////////////////////////////////////////////////////////////////////////

CUTBoolean CUTIL_API

cutStopTimer( const unsigned int name) 

{

	CUTBoolean retval = CUTTrue;

#ifdef _DEBUG

	try 

	{

		StopWatch::get( name).stop();

	}

	catch( const std::exception& ex) 

	{

		std::cerr << "WARNING: " << ex.what() << std::endl;

		retval = CUTFalse;

	}

#else

	StopWatch::get( name).stop();

#endif

	return retval;

}

the answer is i dont know. i know it does not help a lot but it’ll save you time asking me this question.

please don’t kick me toot hard for this >.<

CUDAkk · July 30, 2009, 1:08pm

Yes, you are absolutely corrct here. I have checked it .

jph4599 · July 30, 2009, 1:49pm

Check out section “3.2.6.2 Event” in the programming guide:

The following code sample creates two events:

cudaEvent_t start, stop;

cudaEventCreate(&start);

cudaEventCreate(&stop);

These events can be used to time the code sample of the previous section the following way:

cudaEventRecord(start, 0);

for (int i = 0; i < 2; ++i)

cudaMemcpyAsync(inputDev + i * size, inputHost + i * size, size, cudaMemcpyHostToDevice, stream[i]);

for (int i = 0; i < 2; ++i)

myKernel<<<100, 512, 0, stream[i]>>>(outputDev + i * size, inputDev + i * size, size);

for (int i = 0; i < 2; ++i)

cudaMemcpyAsync(outputHost + i * size, outputDev + i * size, size, cudaMemcpyDeviceToHost, stream[i]);

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

float elapsedTime;

cudaEventElapsedTime(&elapsedTime, start, stop);

They are destroyed this way:

cudaEventDestroy(start);

cudaEventDestroy(stop);

It looks like the nbody SDK example project uses these event timers:

void runBenchmark(int iterations)

{

	// once without timing to prime the GPU

	nbody->update(activeParams.m_timestep);

	cutilSafeCall(cudaEventRecord(startEvent, 0));

	for (int i = 0; i < iterations; ++i)

	{

		nbody->update(activeParams.m_timestep);

	}

	cutilSafeCall(cudaEventRecord(stopEvent, 0));  

	cudaEventSynchronize(stopEvent);

	float milliseconds = 0;

	cutilSafeCall( cudaEventElapsedTime(&milliseconds, startEvent, stopEvent));

	double interactionsPerSecond = 0;

	double gflops = 0;

	computePerfStats(interactionsPerSecond, gflops, milliseconds, iterations);

	

	printf("%d bodies, total time for %d iterations: %0.3f ms\n", 

		   numBodies, iterations, milliseconds);

	printf("= %0.3f billion interactions per second\n", interactionsPerSecond);

	printf("= %0.3f GFLOP/s at %d flops per interaction\n", gflops, 20);   

}

Keep in mind however, these are just device timers and will only time what’s going on with the hardware…perfect if you are just trying to time a kernel.

jph4599 · July 30, 2009, 1:52pm

You are correct, sorry about that.

I missed the indented “cudaThreadSynchronize();” after the call in the original post. I was assuming that since it took 15ms as opposed to the very short amount of time required to launch the kernel, something else (cutilStopTimer) must be doing the sych.

Topic		Replies	Views
SPMT: Single Program Multiple (Exeuction) Time CUDA Programming and Performance	15	3901	July 4, 2009
Kernel execution overhead CUDA Programming and Performance	2	1159	July 6, 2009
Timer&Event CUDA Programming and Performance	3	3587	December 1, 2009
Timing the Kernel CUDA Programming and Performance	3	3727	January 15, 2010
timing performance of kernels how ? cudaprof vs cudaEventRecord vs cutStartTimer CUDA Programming and Performance	3	5300	March 21, 2009
clock() doesn't work properly CUDA Programming and Performance	10	6295	July 3, 2009
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10572	June 21, 2009
time measurement discrepancy timer, clock(), profiling CUDA Programming and Performance	4	6696	April 7, 2010
How properly counting a performance/program time ? CUDA Programming and Performance	4	2568	August 28, 2007
Execution time is different in Profiller and Console. why? CUDA Programming and Performance	4	3742	August 3, 2009

Getting different time for kernel execution.

Related topics