I also wrote this kernel so it makes sense to me that it’s heavy on shared memory read. what I don’t understand is that it has very small l1tex__data_bank_reads.avg.pct_of_peak_sustained_elapsed. To be honest, I don’t really understand what this metric is measuring.
Greg also mentioned about this metric in another post but he didn’t mention the definition of this metric.
So I guess my question is what l1tex_data_brank_reads measures and how’s it different from l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum? If l1tex_data_bank_reads is the sum of shared memory reads from the 32 data banks, and the l1 cache reads from the 32 data banks, why is it so low when this kernel is bottlenecked at share memory reads?
yeah, LDGSTS is used and it’s pretty heavy global read as well, but it’s not the bottleneck I think? Given the L1TEX’s Lsuin is reading its maximum throughput first
l1tex__data_bank_reads.avg is the number of data banks (0-32) read during global, local, texture, surface, or shared reads.
// CASE 1: Every thread will access consecutive 32-bits
__shared__ uint32_t data[1024];
uint32_t value = data[threadIdx.x]; // assume all threads are active and predicated on
// - l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld = 1 per warp
// - l1tex__data_bank_reads = 32 per warp
// CASE 2: Every thread accesses the same data
__shared__ uint32_t data[1024];
uint32_t value = data[0]; // assume all threads are active and predicated on
// - l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld = 1 per warp
// - l1tex__data_bank_reads = 1 per warp
// CASE 3: Every thread accesses bank0 on different line resulting in 32 wavefronts
__shared__ uint32_t data[1024];
uint32_t value = data[threadIdx.x * 32]; // assume all threads are active and predicated on
// - l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld = 32 per warp
// - l1tex__data_bank_reads = 32 per warp (1 per wavefront)
On some GPUs more than 32 banks can be accessed per cycle if multiple clients are accessing the SRAM (e.g. global/local/texture/surface and shared).