Instructions/byte profiler calculation

I’m going over the excellent profiler tutorial by NVIDIA given at SC10 available at:

[indent]http://www.nvidia.com/object/sc10_cuda_tutorial.html[/indent]

and have a question about computing the instructions/byte. Slide 7 of “Analysis Driven Optimization” gives it as:

[indent]op/Byte = 32 * instructions_issued / (128B * (global_store_transaction + L1_global_load miss))[/indent]

For C2050 with ECC off, if this value is less than 3.6 the code is supposed to be bandwidth bound.

My question: Is this really the proper ratio, given that it only counts stores and global loads that miss the L1? Is the idea that L1 hits don’t really matter (I’m getting a 75% hit rate)? How about L1_local_load_miss?

Also, for the 2050, 1030 GFLOP / 144 GBps gives 7.15, not 3.6. What am I missing?

It would be nice if the profiler could compute more of these derived quantities, including “load efficiency” given on slide 22 (gld_request versus L1_global_load_miss + L1_global_load_hit). There are so many counters that it can be overwhelming. Of course the quantities of interest do depend on the algorithm in question, so I understand some restraint.

Thanks,
Matt

The 1030 GFlops/s number counts operations, not instructions. A multiply-add is counted as two operations (which it is), but it’s a single instruction. The 3.6 ratio is purely the fp32 instruction and bandwidth ratio.

You could count L1 hits, but keep in minda that L1 hit bandwidth is 1030 GB/s. Compared to 144 GB/s gmem bandwidth it’s fairly irrelevant - at that point is basically comparable to any other instruction.

Keep in mind that looking at instruction:byte ratio is just an approximation. For one, in order for this to be more accurate you should know your instruction mix and account for their varying throughputs (for example, fp64 throughput is half of fp32 for C2050, so the perfect ratio in that case would be 1.8). The most accurate way to determine the limiter is to do the analysis with modified source code.

Ah, yes. Of course.

Great class by the way. Are there plans to integrate more of what you taught into the Visual Profiler and its documentation?

Thank you,

Matt

Yes, a number of the methods are being added to the VisualProfiler.

Another question about the presentation - on the slide 17, the app throughput is 62GB/sec (out of 114GB/s). How did you get this number? Is the number of memory access transactions reported per SM or it’s the total # transactions executed on the GPU per kernel run?

I’ve tried to calculate it from the numbers on the slide, but somehow it doesn’t add up.

memory access transactions = 1 708 032

transaction size (Fermi) = 128B
mem-only kernel time = 33.27ms

app throughput = (# memory access transactions * transaction size) / time = (1708032 * 128B)/33.27ms ~ 6.2GB/sec (<<62GB/s)

If we multiply by 14x (for 14 SMs on Tesla C2050), we get ~ 87GB/s (>62GB/s)