I have executed code on two cards : a) TESLA C1060 b) GTX 470
GTX 470 results
For TIMER approach I get 0.148 ms
For EVENT approach I get 0.007 ms
TESLA C1060 results
For TIMER approach I get 0.088 ms
For EVENT approach I get 0.083 ms
Question
I see CUDA timers on Windows as plain HighPerformance timing stamped with HOST time while CUDA events are timestamped at GPU side. THerefore events display actual execution time spent on GPU. Is that right.
What bothers me is this crazy difference between two approaches for GTX 470 card. Is it possible to be such huge difference ? If it is than I would kindly ask someone to explain it to me.
On the other hand C1060 results are almost identical which is expectable and OK.
I have seen something similar once, and I was quite confused as well. In my case it turned out that the difference between timer-based timing and event-based timing was due to implicit CUDA initialization. The timer-based timing would include the initialization, and the event-based timing system would initialize CUDA before setting the first timing event, effectively excluding the initialization time from the total timing.
I have added cudaThreadSynchronize() before starting the timer. However, resulsts remained identical.
This kernel computes all combinations N over K. I have dumped all the combinations 100 over 5 to file to verify if kernel is working properly since 0.007ms for event time measurement seems a bit unreal for 75M combinations.
To verify this I had to use global mem for output array - to store 75M kombinations, this slowed down kernel a lot but proved me kernel is working OK.
In this scenario with output array I got following results
GTX 470:
Timers 329 ms
Events 323 ms
C1060
Timers 777 ms
Events 773 ms
It seems logical now - no more dramatical difference in timings.
But this raised another question - why is C1060 failing so badly now 2x, is memory access so much better on FERMI architecture ?
Hmm, GTX 470 memory bandwidth is about 30% higher than that of C1060. There are a couple possible explanations for your case:
your memory read pattern benefits from Fermi’s L1 or L2 cache. You could run your code from the Visual Profiler, from whose counters you can get L1 and L2 hit rates. Counters for L1: l1_global_load_hit, l1_global_load_miss, for L2 look at read_requests and read_hits (or something like that, I don’t remember the exact L2 names off the top of my head).
somewhat related to the above, if your code spills into local memory on both C1060 and GTX470, then Fermi’s L1 cache can help contain local memory spills (again, look at the counter in the profiler: l1_local_load_hit, l1_local_load_miss). Pre-fermi, all spills contribute to global memory traffice, on Fermi only spills that miss in L1 contribute to bus traffic.
also there’s a possibility that your code is bound by something else than global memory bandwidth. For example, GTX 470 has about 70% more instruction throughput.
You mentioned that performance changed once you added writes to gmem. Keep in mind that compiler throws away any code it detects as not contributing to gmem writes. For more details on how to assess how much time your code spends in memory vs arithmetic operations take a look at the Analysis-driven Optimization presentation from SC10 (slide 15 shows the trick for avoiding writes and still keeping all the code):
[url=“CUDA Zone - Library of Resources | NVIDIA Developer”]http://www.nvidia.com/object/sc10_cuda_tutorial.html[/url]