Events vs Timers - big differences measurung kernel execution time

Hello,

I’m quite new to CUDA and I need help on what is correct way of measuring kernel execution time.

I have used CUDA timers and CUDA events for measuring time and got quite different results.

<b>TIMER CODE:</b>

cutStartTimer(hTimer);

KERNEL <<< >>>

cudaeThreadSynchronize();

cutStopTimer(hTimer);

elapsedtime = cutGetTimerValue(hTimer);

printf( "Processing time: %.3f ms\n", elapsedtime);
<b>EVENT CODE:</b>

cudaEventRecord(start, 0);

KERNEL <<< >>>

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&elapsedtime, start, stop);

printf( "Processing time: %.3f ms\n", elapsedtime);

I have executed code on two cards : a) TESLA C1060 b) GTX 470

GTX 470 results

For TIMER approach I get 0.148 ms

For EVENT approach I get 0.007 ms

TESLA C1060 results

For TIMER approach I get 0.088 ms

For EVENT approach I get 0.083 ms

Question

I see CUDA timers on Windows as plain HighPerformance timing stamped with HOST time while CUDA events are timestamped at GPU side. THerefore events display actual execution time spent on GPU. Is that right.

What bothers me is this crazy difference between two approaches for GTX 470 card. Is it possible to be such huge difference ? If it is than I would kindly ask someone to explain it to me.

On the other hand C1060 results are almost identical which is expectable and OK.

Can you please clarify this to me ?

best regards

Mirko

double post deamon :(

I have seen something similar once, and I was quite confused as well. In my case it turned out that the difference between timer-based timing and event-based timing was due to implicit CUDA initialization. The timer-based timing would include the initialization, and the event-based timing system would initialize CUDA before setting the first timing event, effectively excluding the initialization time from the total timing.

many thanks for your reply.

If there is some CUDA guru to explain this - I would be very grateful.

many thsnks

Mirko

Are you calling cudaThreadSynchronize before “staring” the CPU timer? If not, you could be timing previously issued asynchronous calls as well

Hello Paulius,

I have added cudaThreadSynchronize() before starting the timer. However, resulsts remained identical.

This kernel computes all combinations N over K. I have dumped all the combinations 100 over 5 to file to verify if kernel is working properly since 0.007ms for event time measurement seems a bit unreal for 75M combinations.

To verify this I had to use global mem for output array - to store 75M kombinations, this slowed down kernel a lot but proved me kernel is working OK.

In this scenario with output array I got following results

GTX 470:

Timers 329 ms
Events 323 ms

C1060

Timers 777 ms
Events 773 ms

It seems logical now - no more dramatical difference in timings.

But this raised another question - why is C1060 failing so badly now 2x, is memory access so much better on FERMI architecture ?

Regards
Mirko

Hmm, GTX 470 memory bandwidth is about 30% higher than that of C1060. There are a couple possible explanations for your case:

  • your memory read pattern benefits from Fermi’s L1 or L2 cache. You could run your code from the Visual Profiler, from whose counters you can get L1 and L2 hit rates. Counters for L1: l1_global_load_hit, l1_global_load_miss, for L2 look at read_requests and read_hits (or something like that, I don’t remember the exact L2 names off the top of my head).
  • somewhat related to the above, if your code spills into local memory on both C1060 and GTX470, then Fermi’s L1 cache can help contain local memory spills (again, look at the counter in the profiler: l1_local_load_hit, l1_local_load_miss). Pre-fermi, all spills contribute to global memory traffice, on Fermi only spills that miss in L1 contribute to bus traffic.
  • also there’s a possibility that your code is bound by something else than global memory bandwidth. For example, GTX 470 has about 70% more instruction throughput.

You mentioned that performance changed once you added writes to gmem. Keep in mind that compiler throws away any code it detects as not contributing to gmem writes. For more details on how to assess how much time your code spends in memory vs arithmetic operations take a look at the Analysis-driven Optimization presentation from SC10 (slide 15 shows the trick for avoiding writes and still keeping all the code):
http://www.nvidia.com/object/sc10_cuda_tutorial.html

If you track down what it is, please post here.

many thanks paulius

I will investigate it and post my findings. I will posto complete code here - it is in proof of concept stage at the moment.

thanks again,

Mirko