Since thread_inst_executed is at smsp level, if the dram metric is at device level, I have to calculate NUMBER_OF_SMS * 4 * smsp_thread_inst_executed to see how many instructions were executed on the device with respect to the number of dram read transactions. Is this argument correct?
The __ prefix states at what unit the counter is observed.
The unit__metric_name. states how the metric is calculated over the unit instances.
.sum is the sum of the counter across all units of type
.avg is the average of the counter across all units of type
Since thread_inst_executed is at smsp level, if the dram metric is at device level, I have to calculate NUMBER_OF_SMS * 4 * smsp_thread_inst_executed to see how many instructions were executed on the device with respect to the number of dram read transactions.
smsp__thread_inst_executed.sum is already rolled up to the device level.
dram__sectors_read.sum / smsp__thread_inst_executed.sum ==> average number of thread instructions executed per DRAM sector read.
The SM does not know if a memory request reached DRAM. The DRAM controller does not know if a request was from the SM. The dram__sectors_read.sum will include all DRAM reads from all clients including but not limited to SMs (loads, partial stores, icache misses, constant misses), display controller, asynchronous copy engine, CPU via PCIe, nvlink, etc.