Instructions/byte profiler calculation

MFago · November 30, 2010, 6:37pm

I’m going over the excellent profiler tutorial by NVIDIA given at SC10 available at:

[indent]http://www.nvidia.com/object/sc10_cuda_tutorial.html[/indent]

and have a question about computing the instructions/byte. Slide 7 of “Analysis Driven Optimization” gives it as:

[indent]op/Byte = 32 * instructions_issued / (128B * (global_store_transaction + L1_global_load miss))[/indent]

For C2050 with ECC off, if this value is less than 3.6 the code is supposed to be bandwidth bound.

My question: Is this really the proper ratio, given that it only counts stores and global loads that miss the L1? Is the idea that L1 hits don’t really matter (I’m getting a 75% hit rate)? How about L1_local_load_miss?

Also, for the 2050, 1030 GFLOP / 144 GBps gives 7.15, not 3.6. What am I missing?

It would be nice if the profiler could compute more of these derived quantities, including “load efficiency” given on slide 22 (gld_request versus L1_global_load_miss + L1_global_load_hit). There are so many counters that it can be overwhelming. Of course the quantities of interest do depend on the algorithm in question, so I understand some restraint.

Thanks,
Matt

paulius · November 30, 2010, 9:10pm

The 1030 GFlops/s number counts operations, not instructions. A multiply-add is counted as two operations (which it is), but it’s a single instruction. The 3.6 ratio is purely the fp32 instruction and bandwidth ratio.

You could count L1 hits, but keep in minda that L1 hit bandwidth is 1030 GB/s. Compared to 144 GB/s gmem bandwidth it’s fairly irrelevant - at that point is basically comparable to any other instruction.

Keep in mind that looking at instruction:byte ratio is just an approximation. For one, in order for this to be more accurate you should know your instruction mix and account for their varying throughputs (for example, fp64 throughput is half of fp32 for C2050, so the perfect ratio in that case would be 1.8). The most accurate way to determine the limiter is to do the analysis with modified source code.

MFago · December 6, 2010, 3:32pm

The 1030 GFlops/s number counts operations, not instructions. A multiply-add is counted as two operations (which it is), but it’s a single instruction. The 3.6 ratio is purely the fp32 instruction and bandwidth ratio.

You could count L1 hits, but keep in minda that L1 hit bandwidth is 1030 GB/s. Compared to 144 GB/s gmem bandwidth it’s fairly irrelevant - at that point is basically comparable to any other instruction.

Keep in mind that looking at instruction:byte ratio is just an approximation. For one, in order for this to be more accurate you should know your instruction mix and account for their varying throughputs (for example, fp64 throughput is half of fp32 for C2050, so the perfect ratio in that case would be 1.8). The most accurate way to determine the limiter is to do the analysis with modified source code.

Ah, yes. Of course.

Great class by the way. Are there plans to integrate more of what you taught into the Visual Profiler and its documentation?

Thank you,

Matt

paulius · December 6, 2010, 10:43pm

Yes, a number of the methods are being added to the VisualProfiler.

RoofTopG · April 4, 2011, 6:00pm

Another question about the presentation - on the slide 17, the app throughput is 62GB/sec (out of 114GB/s). How did you get this number? Is the number of memory access transactions reported per SM or it’s the total # transactions executed on the GPU per kernel run?

I’ve tried to calculate it from the numbers on the slide, but somehow it doesn’t add up.

memory access transactions = 1 708 032

transaction size (Fermi) = 128B
mem-only kernel time = 33.27ms

app throughput = (# memory access transactions * transaction size) / time = (1708032 * 128B)/33.27ms ~ 6.2GB/sec (<<62GB/s)

If we multiply by 14x (for 14 SMs on Tesla C2050), we get ~ 87GB/s (>62GB/s)

Topic		Replies	Views
instruction-to-byte ratio CUDA Programming and Performance	0	580	April 5, 2014
How to calculate "ideal" ratio of instructions to memory accesses? CUDA Programming and Performance	6	1550	August 23, 2010
Counting flops what's in and what's out? CUDA Programming and Performance	0	1827	June 9, 2012
Instruction issued counter CUDA Programming and Performance	1	629	July 27, 2011
Memory throughput on GTX480 (cudaprof question) how to calculate memory throughput from GST/GLD CUDA Programming and Performance	4	7107	May 28, 2010
Counting Floating Point Operations with nvprof CUDA Programming and Performance	20	4203	September 26, 2016
Calculating Gflops, memory bandwidth and visual profiler question performance calculation CUDA Programming and Performance	3	13727	October 30, 2023
How do you measure the GFLOPS for your kernel? CUDA Programming and Performance	0	946	September 13, 2010
evaluate the FLOPS CUDA Programming and Performance	5	2129	November 25, 2008
profiler instruction count CUDA Programming and Performance	0	3844	November 3, 2009

Instructions/byte profiler calculation

memory access transactions = 1 708 032

Related topics