In reporting the performance of a kernel I’ve written, I’d like to calculate the instruction to byte ratio.
This quantity is defined as “how many arithmetic operations your kernel performs per byte it reads” [1], and the formula is:
RATIO = (32 * instructions_issued) / (128 * global_store_transaction + 128 * L1_global_load_miss)
I understand that this quantity provides an indication of whether the kernel is compute or memory bound. That being said, I don’t understand why we use global store transactions and L1 global load miss.
global_store_transactions is the number of lines stored into global memory
L1_global_load_miss are the number of lines attempted to be loaded from L1 cache but were missed.
Could someone explain why we are using global_store_transactions and L1_global_load_miss ? Could we not just use the dram memory throughput (dram_read_throughput + dram_write_throughput) multiplied by the kernel runtime?
[1] http://babrodtk.at.ifi.uio.no/files/publications/brodtkorb_etal_meta10.pdf