I’m going over the excellent profiler tutorial by NVIDIA given at SC10 available at:
and have a question about computing the instructions/byte. Slide 7 of “Analysis Driven Optimization” gives it as:
[indent]op/Byte = 32 * instructions_issued / (128B * (global_store_transaction + L1_global_load miss))[/indent]
For C2050 with ECC off, if this value is less than 3.6 the code is supposed to be bandwidth bound.
My question: Is this really the proper ratio, given that it only counts stores and global loads that miss the L1? Is the idea that L1 hits don’t really matter (I’m getting a 75% hit rate)? How about L1_local_load_miss?
Also, for the 2050, 1030 GFLOP / 144 GBps gives 7.15, not 3.6. What am I missing?
It would be nice if the profiler could compute more of these derived quantities, including “load efficiency” given on slide 22 (gld_request versus L1_global_load_miss + L1_global_load_hit). There are so many counters that it can be overwhelming. Of course the quantities of interest do depend on the algorithm in question, so I understand some restraint.