I would like to know if the memory stats obtained by nvprof are at warp level or thread level. For example, the description of dram_read_transactions is “Device memory read transactions”. Also, for gld_transactions, I see “Number of global memory load transactions”.
The memory stats are for the kernel.
dram_read_transactions counts 32 byte sectors read from device memory.
gld_transactions varies with architecture:
- In Maxwell - Pascal architecture (CC 5.* - 6.*) the counter counts request packets from SM to L1TEX.
- <=32b load is 4 transactions of 8 threads each
- 64b load is 8 transactions of 4 threads each
- 128b load is 16 transactions of 2 threads
- predicated off threads or inactive threads do not generate transactions
- In Volta - Turing architecture (CC 7.*) the counter counts 32B sectors from L1TEX (hits and misses).