For C2050 with ECC off, if this value is less than 3.6 the code is supposed to be bandwidth bound.
My question: Is this really the proper ratio, given that it only counts stores and global loads that miss the L1? Is the idea that L1 hits don’t really matter (I’m getting a 75% hit rate)? How about L1_local_load_miss?
Also, for the 2050, 1030 GFLOP / 144 GBps gives 7.15, not 3.6. What am I missing?
It would be nice if the profiler could compute more of these derived quantities, including “load efficiency” given on slide 22 (gld_request versus L1_global_load_miss + L1_global_load_hit). There are so many counters that it can be overwhelming. Of course the quantities of interest do depend on the algorithm in question, so I understand some restraint.
The 1030 GFlops/s number counts operations, not instructions. A multiply-add is counted as two operations (which it is), but it’s a single instruction. The 3.6 ratio is purely the fp32 instruction and bandwidth ratio.
You could count L1 hits, but keep in minda that L1 hit bandwidth is 1030 GB/s. Compared to 144 GB/s gmem bandwidth it’s fairly irrelevant - at that point is basically comparable to any other instruction.
Keep in mind that looking at instruction:byte ratio is just an approximation. For one, in order for this to be more accurate you should know your instruction mix and account for their varying throughputs (for example, fp64 throughput is half of fp32 for C2050, so the perfect ratio in that case would be 1.8). The most accurate way to determine the limiter is to do the analysis with modified source code.
Another question about the presentation - on the slide 17, the app throughput is 62GB/sec (out of 114GB/s). How did you get this number? Is the number of memory access transactions reported per SM or it’s the total # transactions executed on the GPU per kernel run?
I’ve tried to calculate it from the numbers on the slide, but somehow it doesn’t add up.
memory access transactions = 1 708 032
transaction size (Fermi) = 128B
mem-only kernel time = 33.27ms