Events vs Timers - big differences measurung kernel execution time

Hello,

I’m quite new to CUDA and I need help on what is correct way of measuring kernel execution time.

I have used CUDA timers and CUDA events for measuring time and got quite different results.

TIMER CODE:

cutStartTimer(hTimer);

KERNEL <<< >>>

cutStopTimer(hTimer);

elapsedtime = cutGetTimerValue(hTimer);

printf( "Processing time: %.3f ms\n", elapsedtime);
EVENT CODE:

cudaEventRecord(start, 0);

KERNEL <<< >>>

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&elapsedtime, start, stop);

printf( "Processing time: %.3f ms\n", elapsedtime);

I have executed code on two cards : a) TESLA C1060 b) GTX 470

GTX 470 results

For TIMER approach I get 0.148 ms

For EVENT approach I get 0.007 ms ??!!

TESLA C1060 results

For TIMER approach I get 0.088 ms

For EVENT approach I get 0.083 ms

Question

I see CUDA timers on Windows as plain HighPerformance timing stamped with HOST time while CUDA events are timestamped at GPU side. THerefore events display actual execution time spent on GPU. Is that right.

What bothers me is this crazy difference between two approaches for GTX 470 card. Is it possible to be such huge difference ? If it is than I would kindly ask someone to explain it to me.

On the other hand C1060 results are almost identical which is expectable and OK.

Can you please clarify this to me ?

best regards

Mirko