Events vs Timers - big differences measurung kernel execution time

mipko · December 20, 2010, 8:05am

Hello,

I’m quite new to CUDA and I need help on what is correct way of measuring kernel execution time.

I have used CUDA timers and CUDA events for measuring time and got quite different results.

<b>TIMER CODE:</b>

cutStartTimer(hTimer);

KERNEL <<< >>>

cudaeThreadSynchronize();

cutStopTimer(hTimer);

elapsedtime = cutGetTimerValue(hTimer);

printf( "Processing time: %.3f ms\n", elapsedtime);

<b>EVENT CODE:</b>

cudaEventRecord(start, 0);

KERNEL <<< >>>

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&elapsedtime, start, stop);

printf( "Processing time: %.3f ms\n", elapsedtime);

I have executed code on two cards : a) TESLA C1060 b) GTX 470

GTX 470 results

For TIMER approach I get 0.148 ms

For EVENT approach I get 0.007 ms

TESLA C1060 results

For TIMER approach I get 0.088 ms

For EVENT approach I get 0.083 ms

Question

I see CUDA timers on Windows as plain HighPerformance timing stamped with HOST time while CUDA events are timestamped at GPU side. THerefore events display actual execution time spent on GPU. Is that right.

What bothers me is this crazy difference between two approaches for GTX 470 card. Is it possible to be such huge difference ? If it is than I would kindly ask someone to explain it to me.

On the other hand C1060 results are almost identical which is expectable and OK.

Can you please clarify this to me ?

best regards

Mirko

mipko · December 20, 2010, 8:33am

double post deamon :(

W4rlock · December 20, 2010, 1:14pm

I have seen something similar once, and I was quite confused as well. In my case it turned out that the difference between timer-based timing and event-based timing was due to implicit CUDA initialization. The timer-based timing would include the initialization, and the event-based timing system would initialize CUDA before setting the first timing event, effectively excluding the initialization time from the total timing.

mipko · December 20, 2010, 10:21pm

many thanks for your reply.

If there is some CUDA guru to explain this - I would be very grateful.

many thsnks

Mirko

paulius · December 20, 2010, 11:59pm

Are you calling cudaThreadSynchronize before “staring” the CPU timer? If not, you could be timing previously issued asynchronous calls as well

mipko · December 21, 2010, 10:32am

Hello Paulius,

I have added cudaThreadSynchronize() before starting the timer. However, resulsts remained identical.

This kernel computes all combinations N over K. I have dumped all the combinations 100 over 5 to file to verify if kernel is working properly since 0.007ms for event time measurement seems a bit unreal for 75M combinations.

To verify this I had to use global mem for output array - to store 75M kombinations, this slowed down kernel a lot but proved me kernel is working OK.

In this scenario with output array I got following results

GTX 470:

Timers 329 ms
Events 323 ms

C1060

Timers 777 ms
Events 773 ms

It seems logical now - no more dramatical difference in timings.

But this raised another question - why is C1060 failing so badly now 2x, is memory access so much better on FERMI architecture ?

Regards
Mirko

paulius · December 21, 2010, 10:36pm

Hmm, GTX 470 memory bandwidth is about 30% higher than that of C1060. There are a couple possible explanations for your case:

your memory read pattern benefits from Fermi’s L1 or L2 cache. You could run your code from the Visual Profiler, from whose counters you can get L1 and L2 hit rates. Counters for L1: l1_global_load_hit, l1_global_load_miss, for L2 look at read_requests and read_hits (or something like that, I don’t remember the exact L2 names off the top of my head).
somewhat related to the above, if your code spills into local memory on both C1060 and GTX470, then Fermi’s L1 cache can help contain local memory spills (again, look at the counter in the profiler: l1_local_load_hit, l1_local_load_miss). Pre-fermi, all spills contribute to global memory traffice, on Fermi only spills that miss in L1 contribute to bus traffic.
also there’s a possibility that your code is bound by something else than global memory bandwidth. For example, GTX 470 has about 70% more instruction throughput.

You mentioned that performance changed once you added writes to gmem. Keep in mind that compiler throws away any code it detects as not contributing to gmem writes. For more details on how to assess how much time your code spends in memory vs arithmetic operations take a look at the Analysis-driven Optimization presentation from SC10 (slide 15 shows the trick for avoiding writes and still keeping all the code):
[url=“CUDA Zone - Library of Resources | NVIDIA Developer”]http://www.nvidia.com/object/sc10_cuda_tutorial.html[/url]

If you track down what it is, please post here.

mipko · December 21, 2010, 11:45pm

many thanks paulius

I will investigate it and post my findings. I will posto complete code here - it is in proof of concept stage at the moment.

thanks again,

Mirko

Topic		Replies	Views
Events vs Timers - big differences measurung kernel execution time CUDA Programming and Performance	0	3819	December 20, 2010
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9346	January 7, 2008
CUDA OpenCL comparison CUDA Programming and Performance	9	3394	August 23, 2011
CUDA event timer or C++11 <chrono> timers, which one should I use? CUDA Programming and Performance	4	3918	May 21, 2019
On timing and timer CUDA Programming and Performance	7	4190	July 15, 2009
Timing cuda code I'm sorry for small for dÃ©ja-vu :-) CUDA Programming and Performance	12	35987	July 12, 2011
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10571	June 21, 2009
Different times Ubuntu Vs Windows CUDA Programming and Performance	8	1676	October 12, 2015
Execution timings varying from instance to instance CUDA Programming and Performance	10	481	September 29, 2023
some cuda question CUDA Programming and Performance	6	980	December 23, 2015

Events vs Timers - big differences measurung kernel execution time

Related topics