DRAM metrics at SM or device level?

I would like to know if the dram transaction metrics, e.g. dram__sectors_read.sum are at device or sm level?
For example, I see

dram__sectors_read.sum         sector      359020.000000     362896.000000     359714.383838
smsp__thread_inst_executed.sum inst        3318416002.000000 3517973660.000000 3431244663.151515

Since thread_inst_executed is at smsp level, if the dram metric is at device level, I have to calculate NUMBER_OF_SMS * 4 * smsp_thread_inst_executed to see how many instructions were executed on the device with respect to the number of dram read transactions. Is this argument correct?

As a follow up, I checked the transactions with the bytes for a kernel

Metric Name             Metric Unit Minimum       Maximum       Average
----------------------- ----------- ------------- ------------- -------------
dram__bytes_read.sum    Mbyte       11.509888     11.618048     11.534279
dram__bytes_write.sum   Kbyte       816.640000    847.360000    828.806465
dram__sectors_read.sum  sector      359684.000000 363064.000000 360446.222222
dram__sectors_write.sum sector      25520.000000  26480.000000  25900.202020

I expect that bytes/32=transactions. However, for the reads we see that this ratio (bytes/transactions) is slightly more than 32.

11.5310241024/360446 = 33.5

I don’t know if that difference is related to the logic of metric collections (flushes, counts, …) or something else. Any thought?

The __ prefix states at what unit the counter is observed.
The unit__metric_name. states how the metric is calculated over the unit instances.

.sum is the sum of the counter across all units of type
.avg is the average of the counter across all units of type

Since thread_inst_executed is at smsp level, if the dram metric is at device level, I have to calculate NUMBER_OF_SMS * 4 * smsp_thread_inst_executed to see how many instructions were executed on the device with respect to the number of dram read transactions.

smsp__thread_inst_executed.sum is already rolled up to the device level.

dram__sectors_read.sum / smsp__thread_inst_executed.sum ==> average number of thread instructions executed per DRAM sector read.

The SM does not know if a memory request reached DRAM. The DRAM controller does not know if a request was from the SM. The dram__sectors_read.sum will include all DRAM reads from all clients including but not limited to SMs (loads, partial stores, icache misses, constant misses), display controller, asynchronous copy engine, CPU via PCIe, nvlink, etc.

1 Like

Mbyte is 1000 x 1000
Mibyte is 1024 x 1024

11.534279 x 1000 x 1000 / 360446.2222 = 32

1 Like

I Appreciate the detailed explanations.